Automatic Backup Validation with AWS Backup, AWS Lambda and Amazon EventBridge

Andrew Smith

Published Mar 28, 2022

Whilst the need for backing up essential data is widely understood, backups are only useful if they: (i) can be successfully restored, (ii) can be restored within any required time duration, and (iii) contain the data they are believed to contain.

The only way to know if backups are fit for purpose is to test them. This can be done manually but it’s a laborious and time consuming process and is subject to having human resource available with the expertise required. And when there are many backups to be verified, possibly every day, this quickly becomes a significant burden.

This article presents a framework for performing the automatic validation of backups in an AWS (Amazon Web Services) environment. After the completion of a scheduled or on-demand backup, the framework restores it, tests data within it, logs summary data such as the restore duration and the data validity, and raises alerts for any issues needing attention. The framework is illustrated as follows:

Two AWS accounts are used - a Production account and a Central Backup account. This separation allows for a high degree of isolation of the backups, which helps ensure their availability in the case of accidental or malicious activity in the Production account. In practice there may be many accounts that use the Central Backup account as a backup destination.

The data to be backed up in this example is the database of an RDS (Relational Database Service) instance, and AWS Backup is used to perform the backups. AWS Backup is a fully managed service that simplifies data protection at scale and which centralises backup plans and the backups themselves. A single plan can span many AWS resource types, e.g. RDS, EC2, EBS and EFS.

For this framework, a scheduled AWS Backup plan is created in the Production account that automatically copies each backup to a backup vault in the Central Backup account as part of the backup process.

The infrastructure used to automatically validate the backups is comprised of AWS Backup, AWS Lambda and Amazon EventBridge. Note that this infrastructure is serverless and so no server instances need to be provisioned, paid for or managed in order to run it. The high level sequence of events that takes place for each completed Production backup is as follows:

once the backup has been copied to the Central Backup backup vault, AWS Backup puts a “Copy Job Complete” event onto the Production EventBridge event bus. This bus has a rule that matches for these events and forwards them to an EventBridge event bus in the Central Backup account
this event bus has a rule that matches for these events and invokes a Lambda function that uses AWS Backup to restore the backup within the Central Backup account. Once this is complete AWS Backup puts a “Restore Job Complete” event onto the local event bus
this event bus has a rule that matches for these events and invokes a Lambda function to open the database and test representative data within it. Summary data regarding the restore operation and the testing of the restored data is written to Amazon S3 and to Amazon CloudWatch Logs. The restored RDS instance is then deleted

The summary data written to S3 can be used by ad-hoc queries or reports that run Amazon Athena SQL queries over the data. The same summary data written to CloudWatch Logs can be analysed via CloudWatch Logs Insights queries. These queries can be ad-hoc or can be incorporated into widgets in CloudWatch Dashboards.

CloudWatch alarms monitor framework infrastructure metrics and also the summary data written to CloudWatch Logs. Alarm notifications can be sent to an SNS topic (e.g. for email/SMS alerts) or to any other supported destination.

Framework Output

Once the framework has validated a backup it produces summary data in this format:

“recovery_point_arn” identifies the backup that was restored and “restored_instance_id” is the identifier of the database instance. "restore_start_time" and "restore_end_time" log when the restore occurred. “restore_duration” is the time taken (in minutes) to perform the restore and “age_latest_db_tx” is the result (in minutes) of an example test that was performed on the restored data.

The example test queries an ‘events’ table in the restored database to find the most recent event timestamp in the database, compares it with ‘now’ and reports the difference. For this example it’s expected that event data in the database being restored should be no more than 80 minutes old (which includes the time taken to backup, copy and restore the database).

The test(s) required for any given situation are obviously dependent upon the nature of the data being backed up. The testing is implemented within the framework’s Lambda function and so can be easily customised.

Lambda Function

The Lambda function is written in Python 3.9 and is available in my GitHub repository here. It handles the “Copy Job Complete” event and the “Restore Job Complete” event, where the latter includes the data-specific test(s) to be carried out on the restored backup.

It writes trace output to CloudWatch Logs to aid with any troubleshooting required, where the level of detail can be changed by adjusting the Python logging level.

Monitoring - CloudWatch Dashboards

A monitoring dashboard has been created in CloudWatch in the Central Backup account:

Recommended by LinkedIn

AWS “Best Practice” Doesn’t Work at Petabyte Scale.

Michael R. 4 days ago

AWS Backup Service – Creating Backup Solution for EBS…

ITGix Ltd 7 months ago

AWS Snapshot Pattern (Data Backups)

Kishore Reddipalli 5 years ago

The top row of metrics and the alarms widget to the right give confidence that the framework is working as expected, where the latter also includes alarms for excessive restore durations and failed data validation tests. The main table shows selected data from the latest backup validation results (each row in the table can be expanded to see the full summary data, e.g. including the restore point ARN).

This is a simple dashboard that aggregates the results for all backups received into the Central Backup backup vault. In practice a number of dashboards may be required which partition results by individual backup source instance or by a subset of backup source instances.

Monitoring - CloudWatch Alarms

The rightmost alarm status widget in the dashboard above shows the five alarms that have been created; these cover both the backup validation infrastructure and the contents of the backup validation results.

“age-latest-db-tx” triggers when the data validation test on a restored backup fails
“restore-duration” triggers when the time taken to restore a backup exceeds a defined threshold
“lambda-errors” triggers if the framework’s Lambda function raises an exception
“missing-lambda” triggers if there have been no Lambda invocations within a defined time period
“missing-restores” triggers if there have been no restores within a defined time period

The top two alarms are defined using CloudWatch metric filters created over the backup validation summary data written to CloudWatch Logs.

All alarms send a notification to an Amazon SNS topic, from where email or SMS alerts can be triggered as well as notification feeds into other systems.

Reporting - Athena

Ad-hoc queries and reports can be run on the validation summary data written to S3, using SQL queries in Athena, e.g.:

A database and table schema were created in the AWS Glue catalog to enable this.

In addition to running queries within the Athena console (above), third-party reporting and business intelligence tools can connect to Athena via its ODBC and JDBC drivers.

Reporting - CloudWatch Logs Insights

Ad-hoc queries can also be run on the validation summary data written to CloudWatch Logs using CloudWatch Logs Insights, e.g.:

Such queries can optionally be used to drive widgets within CloudWatch Dashboards.

Summary

Backups need testing to ensure they are fit for purpose. A framework has been presented that automates backup validation and which outputs validation results in an easily consumable manner. Both the framework infrastructure and the backup validation results are automatically monitored, to notify operational personnel in case of problems.

The framework is cost effective because it is automated and thus minimises the need for human involvement. Also because it is serverless no server instances need to be provisioned, paid for or managed to run it.

The use of automation also results in backups being validated quicker and more consistently than if the process were manually implemented.

Christopher White 2y

This is very cool!!

To view or add a comment, sign in

Automatic Backup Validation with AWS Backup, AWS Lambda and Amazon EventBridge

Andrew Smith

Framework Output

Lambda Function

Monitoring - CloudWatch Dashboards

Recommended by LinkedIn

Monitoring - CloudWatch Alarms

Reporting - Athena

Reporting - CloudWatch Logs Insights

Summary

More articles by Andrew Smith

Others also viewed

The Modern DBA: From Maintenance to Strategic Advisor in the Cloud Era

Migrate AWS RDS from one account to another

Landing Oracle DB on Azure: Where? How?

The Case for Serverless Backups using Oracle Functions

Amazon RDS vs Amazon Aurora: A Comparative Guide for Cloud Professionals

Why AWS RDS is a Game-Changer for Your Database Needs

AWS Storage Gateway

Parameter Store vs. Secrets Manager

Day 17: SQL in the Cloud

Cloud Migration: Oracle Database

Explore content categories

Framework Output

Lambda Function

Monitoring - CloudWatch Dashboards

Recommended by LinkedIn

Monitoring - CloudWatch Alarms

Reporting - Athena

Reporting - CloudWatch Logs Insights

Summary

More articles by Andrew Smith

Backup Policy Compliance Reporting with AWS Backup Audit Manager

Remote Working at Scale with AWS Client VPN and AWS Single Sign-On

Cross-Region access to S3 using AWS VPC Endpoints and Route 53 Resolver

Others also viewed

The Modern DBA: From Maintenance to Strategic Advisor in the Cloud Era

Migrate AWS RDS from one account to another

Landing Oracle DB on Azure: Where? How?

The Case for Serverless Backups using Oracle Functions

Amazon RDS vs Amazon Aurora: A Comparative Guide for Cloud Professionals

Why AWS RDS is a Game-Changer for Your Database Needs

AWS Storage Gateway

Parameter Store vs. Secrets Manager

Day 17: SQL in the Cloud

Cloud Migration: Oracle Database

Explore content categories