Serverless ETL using AWS Lambda

Grzegorz Tkaczyk

Published May 24, 2017

Data Warehouse and Business Intelligence systems require data transformation before it is used for publishing.

Most of architectures contain heavy data integration and transformation processes that are realized by ETL tools or Big Data pipelines.

The modern serverless technologies allow to build efficient and scalable processing frameworks without provisioning of the infrastructure. Billing is based on number of function executions and time of processing.

Let's look into the architecture where the processing is done using AWS Lambda Functions only.

This simplified architecture aims to load data from files to RDBMS (e.g. Redshift). Event based processing transforms and loads data as soon data is available and immediately publish into Redshift where data is consumed by BI Tools.

AWS Lambda function single execution limits contain

1.5 GB of RAM
0.5 GB temp space
5 min of maximum execution time (function is terminated after 5 minutes)

Those limits become a challenge and impact the way how functions are constructed. Data files that arrive to data ingestion S3 bucket are split into smaller files. Then every file is processed by separate function. Their results are collected in intermediate S3 bucket and combined by the next set of functions and stored in data lake S3 bucket that contains data ready for reporting. Afterwords next functions invoke data upload to Redshift database.

Sounds familiar? It is similar to the paradigm that was used in old MapReduce processing: split, process and combine. However there is no need to have Big Data cluster available to process data.

Summary

Serverless architectures are future of real-time systems including BI applications. TCO of the solutions is extremely low. However implementing of business logic requires more resources than in standard ETLs.

This architecture can be ported to Microsoft Azure.

Article describes a selected aspect of Lingaro Storm project.

Oleksandr Darchuk 8y

Nice article, nice use-case. Wonder if usage of the containers or even EC2 for the actual data processing would be a cheaper option? So, lightweight Lambda function just start-up execution instances rather than doing a whole data processing. Assuming this is not real-time data and some delay for spin-up is acceptable. Was it considered as an option?

Artur Kolazda 8y

Greg, good to see that you invested your time in serverless architecture. TCO is indeed significantly lower. How do you trigger lambda? Are you using SQS / SNS for that? By porting to Azure you mean using Azure Functions? That would be rather rewrite than port, right?

See more comments

To view or add a comment, sign in

Serverless ETL using AWS Lambda

Grzegorz Tkaczyk

Summary

More articles by Grzegorz Tkaczyk

Others also viewed

Day 7: ETL—The Unsung Hero of Data

ETL is Evolving - Meet the Modern Data Stack

AWS Glue: Serverless Data Integration Service

This Confluent Service Changes the Game for Data Lakes. Goodbye, ETL (well almost...)

The Death of ETL: How Modern Data Stacks Are Leaving Batch Processing Behind

🔗 Building Reliable ETL Pipelines in Databricks: From Raw Data to Insights

Unleashing the Power of Metadata-Driven ETL: Revolutionizing Data Integration with AWS Lambda and Collibra

🚀 Excited to share insights on designing efficient ETL pipelines with Kafka Events as the data source! 📊💻

Process your Data

ETL vs ELT Data Processing Architecture

Serverless Architecture

AWS Data Transformation for Cloud-Based Solutions

How AWS Simplifies Cloud Architecture

Batch Processing in Big Data

Azure Workload Architecture for Business Growth

Explore content categories

Summary

More articles by Grzegorz Tkaczyk

The touch of the future. Run the code on the real quantum computer.

Others also viewed

Day 7: ETL—The Unsung Hero of Data

ETL is Evolving - Meet the Modern Data Stack

AWS Glue: Serverless Data Integration Service

This Confluent Service Changes the Game for Data Lakes. Goodbye, ETL (well almost...)

The Death of ETL: How Modern Data Stacks Are Leaving Batch Processing Behind

🔗 Building Reliable ETL Pipelines in Databricks: From Raw Data to Insights

Unleashing the Power of Metadata-Driven ETL: Revolutionizing Data Integration with AWS Lambda and Collibra

🚀 Excited to share insights on designing efficient ETL pipelines with Kafka Events as the data source! 📊💻

Process your Data

ETL vs ELT Data Processing Architecture

Similar topics

Serverless Architecture

AWS Data Transformation for Cloud-Based Solutions

How AWS Simplifies Cloud Architecture

Batch Processing in Big Data

Azure Workload Architecture for Business Growth

Explore content categories