Dependable reliability is the first thing to get right with cloud services. It’s not just nice-to-have; the cloud wouldn’t be viable without a credible core of reliability.
And yet, no system can be one hundred percent reliable. But with appropriate management and proactive monitoring, it is possible to reach extremely high levels of sustained uptime.
CloudWatch is an AWS service designed to help achieve high reliability scores when in the AWS environment. Read on to see why CloudWatch is a good idea, and some practical advice about setting it up.
Early Issue Detection: Identify problems before they impact users. Uptime monitoring alerts you immediately when a service goes down, allowing for quick resolution, which leads to:
Improved Reliability: Maintain consistent service availability and build user trust.
Reduced Downtime: Minimise the effect of outages by promptly addressing issues. Quick responses can significantly reduce the impact of downtime.
Data-Driven Decisions: Use monitoring data to make informed decisions about infrastructure and application improvements.
We needed an easy and reliable way to check if our site was up and responding, and we preferred a solution that worked with our AWS setup. Though regular health checks or uptime monitors were available, we were looking for something that was simple to configure and could send alerts straight to our existing support system. Above all, we wanted the monitoring to fit directly into how we already handle support issues.
This led us to CloudWatch Canaries automated scripts.
We deployed a canary to ping our site regularly. Any failure triggers a CloudWatch alarm, notifying our support team via email, and subsequently creating a Jira support ticket. From there on our team handles the problems. The objective? Shift detection time from hours to mere minutes. In the realm of uptime, every minute is critical.
We used OpenTofu (the open-source fork of Terraform) to codify and deploy the whole monitoring setup. Here's a breakdown of the key elements:
Prerequisites
The canary: We configured an AWS CloudWatch canary to hit our NewRedo website's homepage on a regular interval. This essentially mimics a user interacting with our website. It checks for HTTP 200 responses and measures response times.
The alarm: The CloudWatch alarm is set up with the following condition: SuccessPercent < 100 for 2 datapoints within 10 minutes. This means if the canary fails twice in a ten-minute window, the alarm fires. It's a balanced setup that filters out one-off blips but catches persistent failures quickly.
Alerting via Email to Jira: When the alarm triggers, an SNS topic sends an email to our Jira support desk. Jira auto-generates a support ticket, ensuring the issue is logged, prioritised, and handled promptly.
OpenTofu Deployment: We used OpenTofu to manage the entire configuration:canary scripts, IAM roles, CloudWatch alarms, and SNS topics. This makes the setup repeatable and version-controlled, and fits seamlessly into our infrastructure-as-code approach.
To test the configuration, we used a system that had the set up implemented. We temporarily manually brought this system down and checked the CloudWatch dashboard as well as the notification channels to make sure all the alerts are running as expected.
In conclusion, we have had this set up for some time now and it has already paid off. Running in the background quietly and only alerting us when we need to know.