On SQL behind Cloud Data Lakes

On SQL behind Cloud Data Lakes

Intro

If you follow the recent trends in Big Data engineering you should have probably noticed that during the last 18 months all 3 main cloud service providers (Amazon, Google and Microsoft) have intensified the development of the Data Lakes related services.

Note: it would be incorrect to state that the solutions are directly comparable due to fundamental differences in philosophy, implementation and integration with other cloud services.

One of the core services in all 3 solutions is SQL-as-a-Service engine which enables for the creation of Data Lake Big architecture around it.

Idea behind

To be a good SQL-as-a-Service solution it should comply with the following criteria:

  • To run complex SQL (ideally ANSI SQL) queries from cloud data storage directly
  • No or near zero cost for the data lake solution not in use
  • Almost limitless data size. Well, declared capacity to process petabytes of data is next to infinity for most of us
  • Highly available and with dynamic and effortless scalability

Sounds like a Holy Grail...

Almost there...

Maybe later...

Not now at least.

Let us see how far the industry has progressed in this direction.

#1. The first star. Google’s BigQuery

In terms of thought leadership the first place should be definitely given to Google with its BigQuery based on the fact that it was introduced long before the competitors.

BigQuery V2 is born

BigQuery V2 was released in October 2011 and already supported REST API, WEB UI and ability to export results into Google Cloud Storage in CSV format.

Support of standard SQL

It is worth pointing out that Google has never stopped developing BigQuery and by end of September 2016 it has announced the support of standard SQL compatible with SQL 2011.

To my mind, it was a major milestone as one can easily consider now small GCP-BigQuery projects with the significantly lower risk of necessity to master the new flavour of SQL.

The beauty of this release was also the support of UDFs (user defined functions). Sorry, no Python or R in UDFs. Exclusively JavaScript via powerful V8 engine.

Further integration with Google Services

Should you be lucky to work in digital marketing BigQuery now offers a possibility of executing queries directly from AdWords, YouTube or log-level data (Data Transfer) .

Recent perks

SQL language is like the English language. It is rather simple to start learning but takes time to master it properly. As a result, almost anybody can write simple SQL queries but it takes a considerable skill to tune and optimise the performance.

Even if you have experience with relational databases you cannot tune Data Lake SQL in a similar way. It does not use indices and every query run is a full scan.

Unlike the AWS and Azure solutions, the BigQuery offers a pretty good guidance on dos and don'ts as well as several ways how to visualise the performance of each segment of the query.

Bonus

Data Engineering on Google Cloud Platform Specialisation @ Coursera is free for 1 month. In other words, no need to pay 5*65 AUD to complete the Specialisation.

The offer is valid until 31 October 2018.

#2. Enterprise ready. Microsoft Data Lake

A bit on origins

Microsoft Data Lake gen1 was introduced on April 29, 2015 (Wiki) .

By that time it was already a rather mature solution. It has emerged from the internal service called Cosmos (Behind the scenes of Azure Data Lake) which was already in use for quite a while.

Alike others

I must admit that my experience with Azure in general and MS Data Lake, in particular, is the least comprehensive if compared to GCP and AWS.

For decades Microsoft was one of the symbols antagonistic to the Open Source community. All the solutions were proprietary and not portable over non-Windows systems.

When Microsoft decided to oppose this image and prove the whole world that it wants to become more Linux friendly it was quite to believe.

As the result of all these efforts the Data Lake now contains lots of originally non-Microsoft stack tools:

  • Microsoft offers Microsoft R Open which is the enhanced distribution of R
  • R and Python can be used inside MS SQL Server and AzureML
  • Azure HDInsight is a fully managed service which is optimised to work with Apache stack: Spark, Kafka, HBase etc.

But still distinct

If you look closer at the diagram describing the analytics layer of the Microsoft Data Lake you will notice that it contains the icon U-SQL.

Well, Microsoft has T-SQL, which is the extension to SQL. Different to ANSI SQL but still an SQL.

Does it change a lot switching to the next letter in the alphabet in the name of the SQL language? To U-SQL.

Oh, yes. Quite a lot.

According to the MS documentation (U-SQL Language Reference), U-SQL is the mix of SQL and C#. It is a rather heavy mix. The expression language inside SELECT clauses and WHERE predicates is C#.

In other words, you can leave your hope to use your ANSI SQL experience as is for the U-SQL project.

Work in progress

Since 27 June 2018 the Azure Data Lake gen2 is in preview (Introduction to Azure Data Lake Storage Gen2 Preview). The core focus is on positioning the gen2 as a service for building enterprise data lakes (EDL).

It seems that Microsoft is determined to conquer this space and if you are already using the Microsoft stack it seems to be a decent option.

#3. Athena. The God of wisdom has arrived

AWS has announced its Athena in 2016 and progressing extremely fast by adding features and developing related services.



Creation of Athena. Modern way

According to the Greek mythology

Athena was "born" from Zeus's forehead as a result of him having swallowed her mother Metis

One must assume it was rather an efficient way to create as Athena’s glory lasted for millennia.

Nowadays the AWS Athena creators had to follow a slightly more up-to-date path. According to the Amazon WEB pages the AWS Athena was created by AWS team (it seems that Zeus was out of reach) and its ancestor would be the PrestoDB (good news, no need to swallow anybody).

The last but not the least

AWS Athena celebrated its birthday on the 30th November 2016.

19 months after Microsoft Data Lake announcement and 30+ months after the announcement of BigQuery.

Being an open source product one can find Facebook and Twitter companies among the community of users and developers of the project. No Gods or even demigods involved but still rather impressive.

Long awaited but still at the very beginning of its path as of November 2016.

From SQL service to data lakes

Ability to execute the SQL queries over S3 data is great. It allows you to explore the vast amount of data quickly but still leaves all the preparatory work aside.

Since August 2017 Athena is complemented with Glue, the fully scalable ETL solution. Now one can schedule import, clean, transform, repackage and visualise data within one ecosystem.

Similar to ANSI SQL standard supported by Athena Glue comes with Apache Spark behind the scenes thus allowing to re-utilise your previous experience with Hive, PySpark and Scala.

Recent improvements

Surprisingly enough lots of small but important improvements are not publicly announced and thus go unnoticed by the wider community.

AWS Athena has just recently got the ability to create views. Crawler allows for mapping S3 folder into Hive metastore (essentially it means adding an Athena database table) by making several clicks. Huge improvement versus the need to type all the mappings manually earlier.

AWS Glue is also demonstrating significant improvements recently. The Glue’s dynamic data frames have finally got the ability to export into partitioned parquet files, which allows Data Lake architects to finetune the solutions even further.

Still not at a production level

Despite the impression that AWS is heading in the right direction with its Athena-Glue ansible, it has still miles to cover.

Unlike Glue Athena looks quite stable and complete in terms of services if offers as of now. It would be rather handy to finally get the implementation of the UDFs in Python and /or JS, stored procedures and every day DBA tools. Among critical requirements would be the ability to easily trace the sources of memory overflow style issues. We should not forget that we are talking here about big data solution and the SQL querying here may require extra knowledge.

The weakest point in Athena-Glue pair in the author’s opinion is definitely AWS Glue.

Despite being declared as an ETL it does not enforce the metadata compliance and may crash without providing any easy-to-interpret message in case of data types inconsistency. What is even worse there are cases when the use of auto generated Crawler-Glue pair worked without any crashes but was losing some data while processing.

Final comment on AWS Glue-Athena pair

To summarise the concerns mentioned above: yes, you can use it for most of the cases but always impose tight control over the quality of inputs and validate the consistency of input-output.

It will be covered in the next article.


Trivia questions

Q: Which 2 companies used Greek names for their Data Lake stack?

A: AWS uses the name Athena and Microsoft uses Cosmos.


Q: Which companies (Microsoft, Amazon, Google) are the closest in terms of technology stack?

A: Google and AWS. Both are using lots of Apache products, Linux and open source. Not to mention the efforts to comply with ANSI SQL for Data Lake solutions.


Q: Which company has pioneered serverless SQL service?

A: Google with its BigQuery in 2011.



Thanks for sharing the comprehensive overview and comparison. I find they can be handy for medium-sized data as well. I haven't used big query and MS data lake but it was a bit surprising that records are scanned by big query while data can be partitioned by Athena. Support of API can be a definite advantage though. Also as per JS UDFs on big query, node supported by Google/Azure functions/AWS Lambda... Possibly JS might be a must have for data engineers :)

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories