I have a backup script that runs every 2 hours. I want to use CloudWatch to track the successful execution of these scripts and CloudWatch Alarms to receive notifications when a problem with the script is running.
The script puts the data point in the CloudWatch metric after each successful backup:
mon-put-data --namespace Backup --metric-name $metric --unit Count --value 1
I have an alarm that goes into ALARM state when the statistics "Sum" in the metric are less than 2 for a 6-hour period.
To check this setting, a day later I stopped putting the data in the metric (i.e. I commented out the mon-put-data command). Well, in the end the alarm went into ALARM state, and I received an email notification, as expected.
The problem is that after a while the alarm will return to the "OK" state, however, new data is not added to the metric!
Two transitions (OK => ALARM, then ALARM => OK) were registered and I will reproduce the logs in this question. Note that although both show "period: 21600" (i.e. 6 hours), the second shows a 12-hour period between startDate and queryDate; I see that this may explain the transition, but I don’t understand why CloudWatch is considering a 12-hour period to calculate statistics with a 6-hour period!
What am I missing here? How do I set up alarms to achieve what I want (i.e., Receive notification if backups fail)?
{
"Timestamp": "2013-03-06T15:12:01.069Z",
"HistoryItemType": "StateUpdate",
"AlarmName": "alarm-backup-svn",
"HistoryData": {
"version": "1.0",
"oldState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (3.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-05T21:12:44.081+0000",
"startDate": "2013-03-05T15:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
3
],
"threshold": 3
}
},
"newState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T15:12:01.052+0000",
"startDate": "2013-03-06T09:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
1
],
"threshold": 2
}
}
},
"HistorySummary": "Alarm updated from OK to ALARM"
}
The second one I just can't understand:
{
"Timestamp": "2013-03-06T17:46:01.063Z",
"HistoryItemType": "StateUpdate",
"AlarmName": "alarm-backup-svn",
"HistoryData": {
"version": "1.0",
"oldState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T15:12:01.052+0000",
"startDate": "2013-03-06T09:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
1
],
"threshold": 2
}
},
"newState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T17:46:01.041+0000",
"startDate": "2013-03-06T05:46:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
3
],
"threshold": 2
}
}
},
"HistorySummary": "Alarm updated from ALARM to OK"
}