Serverless ETL using AWS Lambda
Data Warehouse and Business Intelligence systems require data transformation before it is used for publishing.
Most of architectures contain heavy data integration and transformation processes that are realized by ETL tools or Big Data pipelines.
The modern serverless technologies allow to build efficient and scalable processing frameworks without provisioning of the infrastructure. Billing is based on number of function executions and time of processing.
Let's look into the architecture where the processing is done using AWS Lambda Functions only.
This simplified architecture aims to load data from files to RDBMS (e.g. Redshift). Event based processing transforms and loads data as soon data is available and immediately publish into Redshift where data is consumed by BI Tools.
AWS Lambda function single execution limits contain
- 1.5 GB of RAM
- 0.5 GB temp space
- 5 min of maximum execution time (function is terminated after 5 minutes)
Those limits become a challenge and impact the way how functions are constructed. Data files that arrive to data ingestion S3 bucket are split into smaller files. Then every file is processed by separate function. Their results are collected in intermediate S3 bucket and combined by the next set of functions and stored in data lake S3 bucket that contains data ready for reporting. Afterwords next functions invoke data upload to Redshift database.
Sounds familiar? It is similar to the paradigm that was used in old MapReduce processing: split, process and combine. However there is no need to have Big Data cluster available to process data.
Summary
Serverless architectures are future of real-time systems including BI applications. TCO of the solutions is extremely low. However implementing of business logic requires more resources than in standard ETLs.
This architecture can be ported to Microsoft Azure.
Article describes a selected aspect of Lingaro Storm project.
Nice article, nice use-case. Wonder if usage of the containers or even EC2 for the actual data processing would be a cheaper option? So, lightweight Lambda function just start-up execution instances rather than doing a whole data processing. Assuming this is not real-time data and some delay for spin-up is acceptable. Was it considered as an option?
Greg, good to see that you invested your time in serverless architecture. TCO is indeed significantly lower. How do you trigger lambda? Are you using SQS / SNS for that? By porting to Azure you mean using Azure Functions? That would be rather rewrite than port, right?