Monitor AWS EC2 instances using AWS CloudWatch Agent and custom metrics. Create CloudWatch Alarms using CloudFormation template.
Nowadays there are many third-party tools that can be used for monitoring EC2 instances. Despite this fact, I would like to give you all the information you need to quickly set up EC2 monitoring using AWS native resources and tools like AWS CloudWatch, CloudWatch Agent, and CloudFormation.
First of all, you should know that by default currently only following metrics are available in the CloudWatch after you launched an EC2 instance: CPUCreditBalance, NetworkPacketsIn, NetworkOut, DiskReadOps, StatusCheckFailed_Instance, DiskReadBytes, NetworkIn, StatusCheckFailed, NetworkPacketsOut, DiskWriteBytes, CPUSurplusCreditsCharged, CPUCreditUsage, CPUSurplusCreditBalance, DiskWriteOps, StatusCheckFailed_System, CPUUtilization.
In case if you would like to monitor DiskSpace, Memory metrics you must install CloudWatch Agent on the instance. The agent will periodically send these custom metrics to AWS CloudWatch.
Below you will find all necessary scripts as well as Cloudformation templates which will allow to set up everything in a few minutes, even if you are new to AWS and have never used AWS CloudWatch before.
First, let’s briefly describe each step:
- Attach IAM Role to EC2. It will allow CloudWatch Agent, installed on the instance, send custom metrics to AWS CloudWatch.
- Using provided examples, create JSON config files and save them to S3. This config files will be used during CloudWatch Agent installation. These files will let CloudWatch Agent know which custom metrics exactly you would like to send to AWS CloudWatch from your instance.
- Install CloudWatch Agent by running provided bash, powershell scripts
- Create SNS topics and add subscribers who will be notified if metric had crossed the threshold
- Create CloudWatch Alarms using Cloudformation templates
Step #1: Make sure that the IAM Role which is attached to your EC2 instance has following AWS managed policy CloudWatchAgentServerPolicy.
Step #2: Create JSON config files using the information provided below. There are two of them because Linux and Windows have slightly different syntax. Please save both files to S3 bucket and make sure they are publicly accessible. Save URL of each file, you will need it further.
Linux OS:
{
"metrics": {
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected": {
"disk": {
"measurement": [
"used_percent"
],
"metrics_collection_interval": 300,
"resources": [
"*"
]
},
"mem": {
"measurement": [
"mem_used_percent"
],
"metrics_collection_interval": 300
}
}
}
}
Windows OS:
{
"metrics": {
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected": {
"LogicalDisk": {
"measurement": [
"% Free Space"
],
"metrics_collection_interval": 300,
"resources": [
"*"
]
},
"Memory": {
"measurement": [
"% Committed Bytes In Use"
],
"metrics_collection_interval": 300
}
}
}
}
Step #3: In order to install CloudWatch Agent on Linux (Ubuntu, CentOS, AmazonLinux) and Windows OS please use following userdata scripts.
For Linux userdata scripts you will need to replace https://YOUR-PUBLIC-URL-HERE-WITH-CONFIG-FILE (see Line 9) with S3 URL of your Linux config file cloudwatchagent-linux-config.json
Ubuntu OS:
#!/bin/bash
mkdir tempcloudwatch
cd tempcloudwatch
apt install wget -y
apt install unzip -y
wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/AmazonCloudWatchAgent.zip
unzip AmazonCloudWatchAgent.zip
sudo ./install.sh
wget https://YOUR-PUBLIC-URL-HERE-WITH-CONFIG-FILE -O config.json
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:config.json -s
CentOS / AmazonLinux OS:
#!/bin/bash
mkdir tempcloudwatch
cd tempcloudwatch
yum install wget -y
yum install unzip -y
wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/AmazonCloudWatchAgent.zip
unzip AmazonCloudWatchAgent.zip
sudo ./install.sh
wget https://YOUR-PUBLIC-URL-HERE-WITH-CONFIG-FILE -O config.json
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:config.json -s
For Windows userdata scripts you will need to replace https://YOUR-PUBLIC-URL-HERE-WITH-CONFIG-FILE (see Line 13) with S3 URL of your Windows config file cloudwatchagent-windows-config.json
Windows OS:
<powershell>
mkdir "c:\cwagent"
wget "https://s3.amazonaws.com/amazoncloudwatch-agent/windows/amd64/latest/AmazonCloudWatchAgent.zip" -OutFile "C:\cwagent\cwagent.zip"
Add-Type -AssemblyName System.IO.Compression.FileSystem
function Unzip
{
param([string]$zipfile, [string]$outpath)
[System.IO.Compression.ZipFile]::ExtractToDirectory($zipfile, $outpath)
}
Unzip "C:\cwagent\cwagent.zip" "C:\cwagent"
cd "C:\cwagent"
.\install.ps1
wget https://YOUR-PUBLIC-URL-HERE-WITH-CONFIG-FILE -OutFile "C:\Program Files\Amazon\AmazonCloudWatchAgent\config.json"
cd “C:\Program Files\Amazon\AmazonCloudWatchAgent”
.\amazon-cloudwatch-agent-ctl.ps1 -a fetch-config -m ec2 -c file:config.json -s
</powershell>
Step #4: Create two SNS topics.
Please create two AWS SNS topics. I recommend you to have one SNS topic for critical type alarms and another SNS topic for warning type alarms. But it’s totally up to you and, for example, if subscribers in both topics are going to be the same, you can simply use ARN of only one topic for criticalsnsarn and warningsnsarn parameters during creation of cloudformation stacks on the next step.
Step #5: Create CloudWatch Alarms using CloudFormation templates.
Following Cloudformation templates will allow you to create CPU, Memory, SystemStatus, InstanceStatus, and DiskSpace CloudWatch alarms. There are 2 types of each alarm WARNING and CRITICAL. The difference between them is simple — each can have a separate SNS topic, and threshold of warning alarms is lower than threshold of critical. Basically, critical alarms should page on-call person and warning alarms should only send an email.
As a stack parameters, you will need to provide: Instance Id, Instance Name, as well as ARNs of SNS topics from step #4. For disk space stack you will also need to provide disk’s name. As a stack name I would recommend you to use Instance Name because CloudWatch alarm in this case will have the following name format EC2InstanceName-MetricType-randomnumbers (for example: SERVER01-DiskSpaceWARNING-1J60H7KL3CC0T).
Finally, because each Cloudformation resource has “DeletionPolicy” set to “Retain”, you can delete the stack after stack’s status had changed to CREATE_COMPLETE. All CloudWatch alarms will not be deleted.
Cloudformation template for Linux OS — CPU / Memory / StatusChecks
AWSTemplateFormatVersion: '2010-09-09'
Description: Linux CloudWatch Alarms - CPU Memory Instance and System Status
#------------------------------------------------------------------------------
Parameters:
#------------------------------------------------------------------------------instanceid:Description: "Choose an instance id"Type: AWS::EC2::Instance::Id
instancename:Description: "Please provide EC2 instance name"Type: "String"MinLength: '1'MaxLength: '50'
criticalsnsarn:Description: "Please provide an ARN of SNS topic - CRITICAL Type"Type: "String"
warningsnsarn:Description: "Please provide an ARN of SNS topic - WARNING Type"Type: "String"
#------------------------------------------------------------------------------
Resources:
#------------------------------------------------------------------------------CPUAlarmWARNING:
Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:AlarmDescription: !Sub "${instancename} - ${instanceid} - High CPU Usage 90%"AlarmActions:
- !Ref warningsnsarn
OKActions:
- !Ref warningsnsarn
MetricName: CPUUtilization
Namespace: AWS/EC2
Statistic: Average
Period: '900'EvaluationPeriods: '1'Threshold: '90'ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
#------------------------------------------------------------------------------CPUAlarmCRITICAL:
Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:AlarmDescription: !Sub "${instancename} - ${instanceid} - High CPU Usage 95%"AlarmActions:
- !Ref criticalsnsarn
OKActions:
- !Ref criticalsnsarn
MetricName: CPUUtilization
Namespace: AWS/EC2
Statistic: Average
Period: '900'
EvaluationPeriods: '2'Threshold: '95'ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
#------------------------------------------------------------------------------MemoryAlarmWARNING:Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:AlarmDescription: !Sub "${instancename} - ${instanceid} - High Memory Usage 90%"AlarmActions:
- !Ref warningsnsarn
OKActions:
- !Ref warningsnsarn
MetricName: "mem_used_percent"Namespace: CWAgent
Statistic: Average
Period: '900'EvaluationPeriods: '1'Threshold: '90'ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
#------------------------------------------------------------------------------MemoryAlarmCRITICAL:
Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:AlarmDescription: !Sub "${instancename} - ${instanceid} - High Memory Usage 95%"AlarmActions:
- !Ref criticalsnsarn
OKActions:
- !Ref criticalsnsarn
MetricName: "mem_used_percent"Namespace: CWAgent
Statistic: Average
Period: '900'
EvaluationPeriods: '2'Threshold: '95'ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
#------------------------------------------------------------------------------SystemStatusAlarmCRITICAL:Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:AlarmDescription: !Sub "${instancename} - ${instanceid} - instance recovery process has been triggered because of failed System Status Check"Namespace: AWS/EC2
MetricName: StatusCheckFailed_System
Statistic: Minimum
Period: '60'EvaluationPeriods: '2'ComparisonOperator: GreaterThanThreshold
Threshold: '0'AlarmActions:
- !Sub "arn:aws:automate:${AWS::Region}:ec2:recover"
- !Ref warningsnsarn
OKActions:
- !Ref warningsnsarn
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
#------------------------------------------------------------------------------InstanceStatusAlarmCRITICAL:
Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:AlarmDescription: !Sub "${instancename} - ${instanceid} - Instance Status Check Failed - please investigate. Troubleshooting: https://goo.gl/Ea27Gd"Namespace: AWS/EC2
MetricName: StatusCheckFailed_Instance
Statistic: Minimum
Period: '60'EvaluationPeriods: '3'ComparisonOperator: GreaterThanThreshold
Threshold: '0'AlarmActions:
- !Ref criticalsnsarn
OKActions:
- !Ref criticalsnsarn
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
#-----------------------------------------------------------------------------
Cloudformation template for Linux OS — DiskSpace Alarms:
AWSTemplateFormatVersion: '2010-09-09'
Description: Linux CloudWatch Diskspace Alarms
#------------------------------------------------------------------------------
Parameters:
#------------------------------------------------------------------------------instanceid:Description: "Choose an instance id"Type: AWS::EC2::Instance::Id
instancename:Description: "Please provide EC2 instance name"Type: "String"MinLength: '1'MaxLength: '50'
criticalsnsarn:Description: "Please provide an ARN of SNS topic - CRITICAL Type"Type: "String"
warningsnsarn:Description: "Please provide an ARN of SNS topic - WARNING Type"Type: "String"
volume:Description: "Provide disk's/folder's name (ex.: xvda1)"Type: "String"Default: "xvda1"
path:Description: "Provide path"Type: "String"Default: "/"
fstype:Description: "Choose fstype - ext4 or xfs -> Ubuntu and AmazonLinux use ext4, CentOS use xfs"Type: String
AllowedValues:
- ext4
- xfs
- btrfs
ConstraintDescription: You must specify ext4,xfs,or btrfs.
#-------------------------------------------------------------------------------
Resources:
#-------------------------------------------------------------------------------DiskSpaceAlarmWARNING:Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:AlarmDescription: !Sub "${instancename} - ${instanceid} - over 90% of ${volume} volume space is in use"AlarmActions:
- !Ref warningsnsarn
OKActions:
- !Ref warningsnsarn
MetricName: "disk_used_percent"Namespace: CWAgent
Statistic: Average
Period: '300'EvaluationPeriods: '1'Threshold: '90'ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
- Name: device
Value: !Ref volume
- Name: path
Value: !Ref path
- Name: fstype
Value: !Ref fstype
#-------------------------------------------------------------------------------DiskSpaceAlarmCRITICAL:Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:AlarmDescription: !Sub "${instancename} - ${instanceid} - over 95% of ${volume} volume space is in use"AlarmActions:
- !Ref criticalsnsarn
OKActions:
- !Ref criticalsnsarn
MetricName: "disk_used_percent"Namespace: CWAgent
Statistic: Average
Period: '300'EvaluationPeriods: '1'Threshold: '95'ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
- Name: device
Value: !Ref volume
- Name: path
Value: !Ref path
- Name: fstype
Value: !Ref fstype
#-------------------------------------------------------------------------------
Cloudformation template for Windows OS — CPU / Memory / StatusChecks:
AWSTemplateFormatVersion: '2010-09-09'
Description: Windows CloudWatch Alarms - CPU Memory Instance and System Status
#------------------------------------------------------------------------------
Parameters:
#------------------------------------------------------------------------------instanceid:Description: "Choose an instance id"Type: AWS::EC2::Instance::Id
instancename:Description: "Please provide EC2 instance name"Type: "String"MinLength: '1'MaxLength: '50'
criticalsnsarn:Description: "Please provide an ARN of SNS topic - CRITICAL Type"Type: "String"
warningsnsarn:Description: "Please provide an ARN of SNS topic - WARNING Type"Type: "String"
#------------------------------------------------------------------------------
Resources:
#------------------------------------------------------------------------------CPUAlarmWARNING:Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:AlarmDescription: !Sub "${instancename} - ${instanceid} - High CPU Usage 90%"AlarmActions:
- !Ref warningsnsarn
OKActions:
- !Ref warningsnsarn
MetricName: CPUUtilization
Namespace: AWS/EC2
Statistic: Average
Period: '900'EvaluationPeriods: '1'Threshold: '90'ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
#------------------------------------------------------------------------------CPUAlarmCRITICAL:Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:AlarmDescription: !Sub "${instancename} - ${instanceid} - High CPU Usage 95%"AlarmActions:
- !Ref criticalsnsarn
OKActions:
- !Ref criticalsnsarn
MetricName: CPUUtilization
Namespace: AWS/EC2
Statistic: Average
Period: '900'EvaluationPeriods: '2'Threshold: '95'ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
#------------------------------------------------------------------------------MemoryAlarmWARNING:Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:AlarmDescription: !Sub "${instancename} - ${instanceid} - High Memory Usage 90%"AlarmActions:
- !Ref warningsnsarn
OKActions:
- !Ref warningsnsarn
MetricName: "Memory % Committed Bytes In Use"Namespace: CWAgent
Statistic: Average
Period: '900'EvaluationPeriods: '1'Threshold: '90'ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
- Name: objectname
Value: Memory
#------------------------------------------------------------------------------MemoryAlarmCRITICAL:Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:AlarmDescription: !Sub "${instancename} - ${instanceid} - High Memory Usage 95%"AlarmActions:
- !Ref criticalsnsarn
OKActions:
- !Ref criticalsnsarn
MetricName: "Memory % Committed Bytes In Use"Namespace: CWAgent
Statistic: Average
Period: '900'EvaluationPeriods: '2'Threshold: '95'ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
- Name: objectname
Value: Memory
#------------------------------------------------------------------------------SystemStatusAlarmCRITICAL:Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:AlarmDescription: !Sub "${instancename} - ${instanceid} - instance recovery process has been triggered because of failed System Status Check"Namespace: AWS/EC2
MetricName: StatusCheckFailed_System
Statistic: Minimum
Period: '60'EvaluationPeriods: '2'ComparisonOperator: GreaterThanThreshold
Threshold: '0'AlarmActions:
- !Sub "arn:aws:automate:${AWS::Region}:ec2:recover"
- !Ref warningsnsarn
OKActions:
- !Ref warningsnsarn
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
#------------------------------------------------------------------------------InstanceStatusAlarmCRITICAL:Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:AlarmDescription: !Sub "${instancename} - ${instanceid} - Instance Status Check Failed - please investigate. Troubleshooting: https://goo.gl/Ea27Gd"Namespace: AWS/EC2
MetricName: StatusCheckFailed_Instance
Statistic: Minimum
Period: '60'EvaluationPeriods: '3'ComparisonOperator: GreaterThanThreshold
Threshold: '0'AlarmActions:
- !Ref criticalsnsarn
OKActions:
- !Ref criticalsnsarn
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
#------------------------------------------------------------------------------
Cloudformation template for Windows OS— DiskSpace Alarm:
AWSTemplateFormatVersion: '2010-09-09'
Description: Windows CloudWatch Diskspace Alarms
#-------------------------------------------------------------------------------
Parameters:
#-------------------------------------------------------------------------------
instanceid:
Description: "Choose an instance id"
Type: AWS::EC2::Instance::Id
instancename:
Description: "Please provide EC2 instance name"
Type: "String"
MinLength: '1'
MaxLength: '50'
criticalsnsarn:
Description: "Please provide an ARN of SNS topic - CRITICAL Type"
Type: "String"
warningsnsarn:
Description: "Please provide an ARN of SNS topic - WARNING Type"
Type: "String"
volume:
Description: "Provide Disk name (ex.: C:)"
Type: "String"
Default: "C:"
MinLength: '1'
MaxLength: '5'
#-------------------------------------------------------------------------------
Resources:
#-------------------------------------------------------------------------------
DiskSpaceWARNING:
Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:
AlarmDescription: !Sub "${instancename} - ${instanceid} - over 90% of ${volume} Drive space is in use"
AlarmActions:
- !Ref warningsnsarn
OKActions:
- !Ref warningsnsarn
MetricName: "LogicalDisk % Free Space"
Namespace: CWAgent
Statistic: Average
Period: '300'
EvaluationPeriods: '1'
Threshold: '10'
ComparisonOperator: LessThanOrEqualToThreshold
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
- Name: instance
Value: !Ref volume
- Name: objectname
Value: LogicalDisk
#-------------------------------------------------------------------------------
DiskSpaceCRITICAL:
Type: AWS::CloudWatch::Alarm
DeletionPolicy: Retain
Properties:
AlarmDescription: !Sub "${instancename} - ${instanceid} - over 95% of ${volume} Drive space is in use"
AlarmActions:
- !Ref criticalsnsarn
OKActions:
- !Ref criticalsnsarn
MetricName: "LogicalDisk % Free Space"
Namespace: CWAgent
Statistic: Average
Period: '300'
EvaluationPeriods: '1'
Threshold: '5'
ComparisonOperator: LessThanOrEqualToThreshold
Dimensions:
- Name: InstanceId
Value: !Ref instanceid
- Name: instance
Value: !Ref volume
- Name: objectname
Value: LogicalDisk
#-------------------------------------------------------------------------------
In my opinion, it is always good to have StatusCheck CloudWatch alarms because they allow you to monitor an instance health. In addition, SystemStatusCheck CloudWatch Alarm in Cloudformation templates is configured in such way so in case if SystemStatus of your instance changes to ALARM state, which usually means there is an issue on AWS hypervisor or level below that, CloudWatch alarm will trigger EC2 recover action. This action allows to recover unhealthy instance by running EC2 Start and Stop commands. As a result, in most cases, the instance will be migrated to a new underlying host computer when it’s started.
Thank you for your time, I hope that all the information was clear and you did not have any issues setting up CloudWatch. Finally, I would like to share some URLs that might be helpful if you want customize or add something. Good luck!
- CloudWatch Agent configuration file creation: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create-cloudwatch-agent-configuration-file.html
- Modify CloudWatch Agent download link, depending on your architecture and platform: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-first-instance.html
- Modify CloudWatch Alarms settings in Cloudformation template: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-cw-alarm.html