Demystifying the AWS Datalake
AWS provides a rich ecosystem of services to employ in delivering data products. In fact, the number of services are so large that it can be daunting for an organisation to navigate the landscape.
One of the guiding principles of cloud based Datalakes is the separation of compute resources from storage resources. In general, the set of storage options are much easier to choose between:
Object Storage (S3)
Key-Value/Document Stores (DynamoDB, DocumentDB, OpenSearch)
Data Warehouses (Redshift)
Given the intended purpose of a Datalake is to act as a repository of all information across an organisation, to centralise data from many heterogeneous systems, to break down data silos and provide a central authority on what data is available and at what quality, a Datalake needs the maximum flexibility to store the data. As such, object storage is the typical choice. Another common alternative is to centralise an organisations most frequently accessed data in a Data Warehouse with data archived onto object storage, with integration (Redshift spectrum) between the Warehouse (Redshift) and object storage (S3) allowing for access to archived data on object storage from within the warehouse.
Less commonly or indirectly used storage options for Datalakes and data products:
Block storage (EBS)
Recommended by LinkedIn
While there are several choices for data storage for AWS data platforms, as noted, their use cases are clear and consensus around their usage has been largely established. However, how data is acted on, transformed, loaded, moved, cleaned, and enriched continues to be a difficult question to answer. In many cases it depends on the properties of the analysis that the data is required for, in others how the data is captured and by whom determines the tool choice, in some cases the volume of data is the primary decision factor and in others the velocity determines the mechanism. Regardless, there are many options for doing work with your data on the AWS cloud. In fact, with the recent release of EMR serverless and the continued evolution of Redshift these decisions continue to get progressively more difficult with the range of options, seemingly endlessly, growing.
Datalake compute is typically separated into 2 tiers:
Orchestration defines the schedules, orders, dependencies, and dispatch of how data moves through the Datalake. Typically this is where retry and recovery are managed. Since this component is the centralised driver of activity within the Datalake, it is generally separated and isolated from the systems that implement the transform. The orchestration machinery is also responsible for updating catalogues, maintaining data lineage definitions, updating freshness reporting tools and other metadata management tasks.
Transformation, on the other hand, concerns itself with the low level operations on the data itself: renaming and restructuring data, joining tables together, converting formats, filtering and aggregations.
The intent of separating orchestration from transformation is to ensure the uptime and continued operation of the orchestration which will in turn manage, monitor and attempt to recover from failures of the transforms. Also, this separation allows for these two distinct activities to be scaled and optimised independently. The orchestration scales according to the number and interconnectedness of datasets and transformations within the Datalake. The transform, themselves, however, scale according to the data sizes, the transformations required, and the layouts and formats the data is persisted in on object storage.
Typical tool choices for orchestration on the AWS cloud include:
This is without making mention of the plethora of third party orchestration and ETL management software available as SaaS solutions and on the AWS marketplace.
Common options utilised for transforming the data itself include:
This list is by no means comprehensive: it fails to address streaming workloads; there is no mention of third party offerings; it doesn’t look more deeply at higher level tools like dbt that compile to views and stored procedures and significantly reduce the effort to build certain transformations.
These lists only provide an overview, some of the landmarks on the AWS data services map. There are many considerations necessary to ensure a successful deployment: what skills exist in your organisation, what DevOps and deployment practices exist, the quality and completeness of the integration between your orchestration tool and the transformation processes utilised, how the storage tiers are configured and data formatted to maximise efficiency and minimise costs. The interplay between the people, the orchestration, transformation and the business processes around data initiatives are all critical to success.
With this endless array of alternatives and combinations for data engineering teams to contend with, it is hardly surprising that many stick to tools and techniques that they are familiar with from on-premise and in so doing miss out on the advantages cloud native data solutions offer while many others wallow in analysis paralysis while others still battle against unstructured adoption of the tools without a longer term strategy. While there is no one-size-fits-all answer, there are well trodden pathways and best practices that a partner like TechConnect can utilise to accelerate your successful delivery of an organisational wide data program. With many successful data projects in our history and the hard won learnings that come from delivering real world solutions, TechConnect can create and evolve a strategy for extracting value from your data and share in a long term relationship to help grow the data capabilities of your organisation into the future.
Nice article Dr Timothy Cleaver, a great testament to the power Amazon Web Services (AWS) brings to organisations looking for data transformation