Demystifying the AWS Datalake

AWS provides a rich ecosystem of services to employ in delivering data products. In fact, the number of services are so large that it can be daunting for an organisation to navigate the landscape.

One of the guiding principles of cloud based Datalakes is the separation of compute resources from storage resources. In general, the set of storage options are much easier to choose between:

Object Storage (S3)

objects of any size – though how these are structured can matter
any type – binary, text, images, video
cheap to employ and mechanisms exist to control data life cycles further reducing costs
Appropriately encoded data can be queried in a structured manner

Key-Value/Document Stores (DynamoDB, DocumentDB, OpenSearch)

Smaller payloads
Semi-structured/unstructured data
Can handle large volumes and many changes to individual items
Limited query capabilities

Databases (RDS, Aurora)

Structured data
Rich query interface
Active constraint and integrity enforcement
Transactional

Data Warehouses (Redshift)

Structured data
Optimised for analytical queries (aggregations on large tables
Not transactional
No enforcement of integrity

Given the intended purpose of a Datalake is to act as a repository of all information across an organisation, to centralise data from many heterogeneous systems, to break down data silos and provide a central authority on what data is available and at what quality, a Datalake needs the maximum flexibility to store the data. As such, object storage is the typical choice. Another common alternative is to centralise an organisations most frequently accessed data in a Data Warehouse with data archived onto object storage, with integration (Redshift spectrum) between the Warehouse (Redshift) and object storage (S3) allowing for access to archived data on object storage from within the warehouse.

Less commonly or indirectly used storage options for Datalakes and data products:

File Storage (EFS, FSx)

Supports file system access semantics (read, write, seek, append, etc)

Block storage (EBS)

Backings for file systems and file system images.

While there are several choices for data storage for AWS data platforms, as noted, their use cases are clear and consensus around their usage has been largely established. However, how data is acted on, transformed, loaded, moved, cleaned, and enriched continues to be a difficult question to answer. In many cases it depends on the properties of the analysis that the data is required for, in others how the data is captured and by whom determines the tool choice, in some cases the volume of data is the primary decision factor and in others the velocity determines the mechanism. Regardless, there are many options for doing work with your data on the AWS cloud. In fact, with the recent release of EMR serverless and the continued evolution of Redshift these decisions continue to get progressively more difficult with the range of options, seemingly endlessly, growing.

Datalake compute is typically separated into 2 tiers:

Orchestration
Transformation

Orchestration defines the schedules, orders, dependencies, and dispatch of how data moves through the Datalake. Typically this is where retry and recovery are managed. Since this component is the centralised driver of activity within the Datalake, it is generally separated and isolated from the systems that implement the transform. The orchestration machinery is also responsible for updating catalogues, maintaining data lineage definitions, updating freshness reporting tools and other metadata management tasks.

Transformation, on the other hand, concerns itself with the low level operations on the data itself: renaming and restructuring data, joining tables together, converting formats, filtering and aggregations.

The intent of separating orchestration from transformation is to ensure the uptime and continued operation of the orchestration which will in turn manage, monitor and attempt to recover from failures of the transforms. Also, this separation allows for these two distinct activities to be scaled and optimised independently. The orchestration scales according to the number and interconnectedness of datasets and transformations within the Datalake. The transform, themselves, however, scale according to the data sizes, the transformations required, and the layouts and formats the data is persisted in on object storage.

Typical tool choices for orchestration on the AWS cloud include:

AWS Glue Workflows/Blueprints
Lambda Step Functions
Amazon Managed Workflows for Apache Airflow
Custom solutions using EventBridge, SNS, S3 events & Lambda

This is without making mention of the plethora of third party orchestration and ETL management software available as SaaS solutions and on the AWS marketplace.

Common options utilised for transforming the data itself include:

Athena CREATE TABLE AS (CTAS) queries
AWS Glue Spark or PySpark jobs
AWS Glue Elastic Views
Pig, Hadoop, Sqoop, Spark jobs running on EMR, EMR serverless or EMR on AWS Fargate
Spark jobs running on Amazon EKS
AWS Lambda functions
S3 Batch Operations
AWS Batch jobs
Traditional ETL scripts on EC2
Amazon SageMaker Data Wrangler
AWS Data Pipelines
AWS Glue Databrew
Redshift Views
AWS Database Migration Service
Amazon Kinesis Data Firehose Transformations

This list is by no means comprehensive: it fails to address streaming workloads; there is no mention of third party offerings; it doesn’t look more deeply at higher level tools like dbt that compile to views and stored procedures and significantly reduce the effort to build certain transformations.

These lists only provide an overview, some of the landmarks on the AWS data services map. There are many considerations necessary to ensure a successful deployment: what skills exist in your organisation, what DevOps and deployment practices exist, the quality and completeness of the integration between your orchestration tool and the transformation processes utilised, how the storage tiers are configured and data formatted to maximise efficiency and minimise costs. The interplay between the people, the orchestration, transformation and the business processes around data initiatives are all critical to success.

With this endless array of alternatives and combinations for data engineering teams to contend with, it is hardly surprising that many stick to tools and techniques that they are familiar with from on-premise and in so doing miss out on the advantages cloud native data solutions offer while many others wallow in analysis paralysis while others still battle against unstructured adoption of the tools without a longer term strategy. While there is no one-size-fits-all answer, there are well trodden pathways and best practices that a partner like TechConnect can utilise to accelerate your successful delivery of an organisational wide data program. With many successful data projects in our history and the hard won learnings that come from delivering real world solutions, TechConnect can create and evolve a strategy for extracting value from your data and share in a long term relationship to help grow the data capabilities of your organisation into the future.

Demystifying the AWS Datalake

Timothy Cleaver

Recommended by LinkedIn

Others also viewed

Dealing with Time Series Data : AWS DynamoDB vs GCP BigTable

Amazon Redshift’s Top Performance Features and Latest Capabilities

Amazon Glue

Simple, serverless data warehousing with AWS S3 and Athena

Data Lake vs. Data Warehouse on AWS: When to Use S3, Redshift, and Lake Formation

Harnessing the Power of AWS DynamoDB

Cleaning up AWS Elastic Block Store Snapshot Storage

DynamoDB Difinition & Data Modeling

AWS update of Week 19 (8May-14May)

Part 4: Unlocking Multi-Source Data Analysis with Athena Federated Queries

Explore content categories