Amazon Cloudwatch is an AWS monitoring service you can use to monitor your applications, services, and resources. It enables you to collect performance and operational data and compiles that data into interpretable metrics.
From the Cloudwatch homepage, you can visualize your metrics data in pre-built charts. You can also create custom dashboards to highlight the specific metrics or sets of metrics you need. These dashboards can provide system-wide visibility into your operations while alarms can alert you to resource or status changes.
In this article, you will learn:
Amazon Cloudwatch is made of three main components—storage, Cloudwatch Alarms, and an analytics and visualization engine. It serves as a metrics repository, holding metrics data delivered to it by your services and then exposes this data through analytics or alarms. You can also use Cloudwatch to ingest external data sources, keeping all of your metrics monitoring in one place.
Image Source: AWS
Many AWS services, including EC2, Kinesis, and S3, automatically send metrics to Cloudwatch in one and five minute increments. For the first 12 months of your account, you can access a limited amount of these metrics for free. After that, there is a charge. To access your metrics data you can use the Management Console or ingest Cloudwatch data with third party tools.
Related content: read our guide to AWS monitoring tools (coming soon)
If you have resources in multiple regions, a Cloudwatch service operates in each. This means metrics from each region are stored separately. However, you can aggregate metrics using the Cloudwatch cross-region features as needed.
Related content: read our guide to Cloudwatch log insights which can help you run queries on log data collected by Cloudwatch
When using Cloudwatch the following concepts are important to understand. These can help ensure you are interpreting metrics correctly and that you can apply insights effectively.
Namespaces are containers used to store metrics. These containers enable you to separate metrics from your various services and applications to ensure that you know which of your assets metrics are related to. When adding data sources to Cloudwatch, you need to define a namespace for each source and metric you collect.
Metrics are time-series data sent to Cloudwatch. When accessing metrics, you can view data points individually or as an ordered set, depending on whether you want point in time information or trends.
Each metric you collect is defined by a namespace, zero or more dimensions, timestamps, and (if relevant) a unit of measure. Once metrics are collected, you cannot manually delete them. Instead, data is automatically purged after 15 months.
Timestamps are metadata used to determine when a metric was created or ingested. You can define timestamps from two hours in the future to two weeks in the past. Or, Cloudwatch can manually assign a stamp based on when the metric was received. Timestamps are used to order metrics data for analysis and to trigger alarms.
Dimensions are name/value pairs that categorize metric characteristics. Each metric you create can have up to 10 dimensions defined. You can use these dimensions to distinguish between multiple instances of the same service and to filter metrics by service use. For example, you can assign InstanceId dimensions to your EC2 instances to distinguish between them for monitoring.
Alarms in Cloudwatch are thresholds defined by you for specific metrics. Alarms trigger according to state changes, not current values. You can use these alarms to start, stop, or terminate resources or to send a notification to your team that something has changed. To automate responses, you need to tie Cloudwatch Events to your service. Then you can use events to trigger actions.
When monitoring your EC2 instances, the following metrics can help you ensure that your instances are performing as expected.
These metrics, in combination, can give you a solid understanding of how your network connections are performing. The network in/out metrics return your traffic volume in bytes while the packets in/out metrics return it in number of packets sent.
When watching these metrics, you should look for sharp increases, decreases, or long periods of time without traffic. When creating alarms for these values, consider using the Average statistic and alert when two points in a given period deviate.
Depending on the type of EC2 instances you are using, you may benefit from the CPUCreditBalance metric. This metric measures your current burst credit balance. Burst credits are used to enable your instances to process more operations temporarily. If you run out of burst credits, your performance will suddenly drop to baseline performance for that instance type.
When watching this metric, you need to catch changes before your credits run out. The best way to do this is to set an alarm for when your metric Average statistic drops below 25% of your maximum balance.
The CPUUtilization metric, measures the total percent of your instance CPU resources in use. This metric should fluctuate over time but should generally remain below 80% of your total available CPU. If it is consistently 50% or lower, this is an indication that you can downscale your instance.
When watching this metric, try to avoid looking at point in time values and focus on time periods. You should set alarms for consistent Average statistics over a certain percent, depending on your instance's priority.
When monitoring your EBS volumes, the following metrics can help you ensure that you have the right amount of resources available.
These metrics measure the amount of time your volume spends performing read/write operations. While these metrics are not very informative independently, together with read and write operations metrics they can help you monitor your volume latency.
To calculate this value, Cloudwatch uses the following formula:
(VolumeTotalReadTime + VolumeTotalWriteTime) / (VolumeReadOps + VolumeWriteOps)
You can set alarms based on this formula by using the Sum statistic for all metrics involved.
This metric measures how many volume operations you have queued. You should expect there to be a small queue at most or all times since a queue of zero means an idle volume. The goal is to alert before your queue gets too large.
When measuring this metric, you should set an alarm based on the Average statistic. If your average consistently increases over a given period with no decreases, this indicates an issue.
This metric measures how much of your provisioned IOPS are available. This is different than what percentage is in active use. To fall within service level agreements, your volumes should remain within 10% of your provisioned amount 99.9% of the time. If you are seeing greater variation than this, it can indicate a problem with your application or the service itself.
When monitoring this metric, use the Average statistic and watch for values less than 90%. Because AWS generally handles issues with insufficient availability, this typically does not need to be a high priority alert.
NetApp Cloud Insights is an infrastructure monitoring tool that gives you visibility into your complete infrastructure. With Cloud Insights, you can monitor, troubleshoot and optimize all your resources including your public clouds and your private data centers.
Unlike CloudWatch, NetApp Cloud Insights provides:
Start a 30-day free trial of NetApp Cloud Insights. No credit card required