Automatic Backup Validation with AWS Backup, AWS Lambda and Amazon EventBridge
Whilst the need for backing up essential data is widely understood, backups are only useful if they: (i) can be successfully restored, (ii) can be restored within any required time duration, and (iii) contain the data they are believed to contain.
The only way to know if backups are fit for purpose is to test them. This can be done manually but it’s a laborious and time consuming process and is subject to having human resource available with the expertise required. And when there are many backups to be verified, possibly every day, this quickly becomes a significant burden.
This article presents a framework for performing the automatic validation of backups in an AWS (Amazon Web Services) environment. After the completion of a scheduled or on-demand backup, the framework restores it, tests data within it, logs summary data such as the restore duration and the data validity, and raises alerts for any issues needing attention. The framework is illustrated as follows:
Two AWS accounts are used - a Production account and a Central Backup account. This separation allows for a high degree of isolation of the backups, which helps ensure their availability in the case of accidental or malicious activity in the Production account. In practice there may be many accounts that use the Central Backup account as a backup destination.
The data to be backed up in this example is the database of an RDS (Relational Database Service) instance, and AWS Backup is used to perform the backups. AWS Backup is a fully managed service that simplifies data protection at scale and which centralises backup plans and the backups themselves. A single plan can span many AWS resource types, e.g. RDS, EC2, EBS and EFS.
For this framework, a scheduled AWS Backup plan is created in the Production account that automatically copies each backup to a backup vault in the Central Backup account as part of the backup process.
The infrastructure used to automatically validate the backups is comprised of AWS Backup, AWS Lambda and Amazon EventBridge. Note that this infrastructure is serverless and so no server instances need to be provisioned, paid for or managed in order to run it. The high level sequence of events that takes place for each completed Production backup is as follows:
The summary data written to S3 can be used by ad-hoc queries or reports that run Amazon Athena SQL queries over the data. The same summary data written to CloudWatch Logs can be analysed via CloudWatch Logs Insights queries. These queries can be ad-hoc or can be incorporated into widgets in CloudWatch Dashboards.
CloudWatch alarms monitor framework infrastructure metrics and also the summary data written to CloudWatch Logs. Alarm notifications can be sent to an SNS topic (e.g. for email/SMS alerts) or to any other supported destination.
Framework Output
Once the framework has validated a backup it produces summary data in this format:
“recovery_point_arn” identifies the backup that was restored and “restored_instance_id” is the identifier of the database instance. "restore_start_time" and "restore_end_time" log when the restore occurred. “restore_duration” is the time taken (in minutes) to perform the restore and “age_latest_db_tx” is the result (in minutes) of an example test that was performed on the restored data.
The example test queries an ‘events’ table in the restored database to find the most recent event timestamp in the database, compares it with ‘now’ and reports the difference. For this example it’s expected that event data in the database being restored should be no more than 80 minutes old (which includes the time taken to backup, copy and restore the database).
The test(s) required for any given situation are obviously dependent upon the nature of the data being backed up. The testing is implemented within the framework’s Lambda function and so can be easily customised.
Lambda Function
The Lambda function is written in Python 3.9 and is available in my GitHub repository here. It handles the “Copy Job Complete” event and the “Restore Job Complete” event, where the latter includes the data-specific test(s) to be carried out on the restored backup.
It writes trace output to CloudWatch Logs to aid with any troubleshooting required, where the level of detail can be changed by adjusting the Python logging level.
Monitoring - CloudWatch Dashboards
A monitoring dashboard has been created in CloudWatch in the Central Backup account:
Recommended by LinkedIn
The top row of metrics and the alarms widget to the right give confidence that the framework is working as expected, where the latter also includes alarms for excessive restore durations and failed data validation tests. The main table shows selected data from the latest backup validation results (each row in the table can be expanded to see the full summary data, e.g. including the restore point ARN).
This is a simple dashboard that aggregates the results for all backups received into the Central Backup backup vault. In practice a number of dashboards may be required which partition results by individual backup source instance or by a subset of backup source instances.
Monitoring - CloudWatch Alarms
The rightmost alarm status widget in the dashboard above shows the five alarms that have been created; these cover both the backup validation infrastructure and the contents of the backup validation results.
The top two alarms are defined using CloudWatch metric filters created over the backup validation summary data written to CloudWatch Logs.
All alarms send a notification to an Amazon SNS topic, from where email or SMS alerts can be triggered as well as notification feeds into other systems.
Reporting - Athena
Ad-hoc queries and reports can be run on the validation summary data written to S3, using SQL queries in Athena, e.g.:
A database and table schema were created in the AWS Glue catalog to enable this.
In addition to running queries within the Athena console (above), third-party reporting and business intelligence tools can connect to Athena via its ODBC and JDBC drivers.
Reporting - CloudWatch Logs Insights
Ad-hoc queries can also be run on the validation summary data written to CloudWatch Logs using CloudWatch Logs Insights, e.g.:
Such queries can optionally be used to drive widgets within CloudWatch Dashboards.
Summary
Backups need testing to ensure they are fit for purpose. A framework has been presented that automates backup validation and which outputs validation results in an easily consumable manner. Both the framework infrastructure and the backup validation results are automatically monitored, to notify operational personnel in case of problems.
The framework is cost effective because it is automated and thus minimises the need for human involvement. Also because it is serverless no server instances need to be provisioned, paid for or managed to run it.
The use of automation also results in backups being validated quicker and more consistently than if the process were manually implemented.
This is very cool!!