Hello Athena
Last December I attended my second AWS re:Invent conference in Las Vegas. One of the many new services released was Athena a managed query service powered by PrestoDB.
PrestoDB is a distributed ANSI SQL query engine for data sources ranging from gigabytes to petabytes. Much like Apache Drill and SparkSQL, it allows you to read and join a variety of data sources. It's a also a schema on read database unlike traditional data warehousing solutions we use like Redshift or Snowflake.
Athena handles all of the infrastructure and you simply submit querys and pay $5 per TB of data scanned. The use of a columnar storage formats like Parquet and Hive partitioning will help limit the about of data scanned.
Today, I will be taking my first hands on look at Athena to see if we can decrease cost over some of our Spark Jobs.
I have 1.7 TB of unpartitioned Parquet data so a query across all the columns would cost $9.70.
The queries I will be submitting are all temporal in nature so partitioning by year, month, day and hour should improve query time and decrease cost. However partitioning by day and hour would produce 8,760 partitions. In my experience writing this many partitions to S3 is slow and error prone so I will partition by Date (yyyy-MM-dd) for this test.
You should also know that storing your data in a columnar format like Parquet can also reduce cost because PrestoDB will only scan the necessary columns.
Let's start by loading up our data into a Spark program.
Next, we can use SparkSQL to add a column for our partitioning strategy.
And finally we can partition our data and save it back to S3.
This Spark job will kick off a big network shuffle. You can watch your cluster network stats in Ganglia while you wait or you can go grab coffee.
Creating Tables
Athena uses terms from traditional relational databases like table and database, however the underlining structure is much different. For example tables are created with the Hive DDL and it manages an internal catalog of the data. You can safely drop tables without effecting the original data stored on S3.
Let's create a table for the data we just staged.
You will likely run a query at this time and get empty results. This is because the partitions are not loaded yet in the Athena Catalog Manager. We resolve this issue by running.
Ok we are now ready to test some of our queries.
Submitting Queries to Athena
At the time of writing Athena only supports submitting queries via the UI or JDBC. Presto it's self also offers a command line interface and the Terradata distribution has added support for ODBC.
First you need to know that Presto doesn't automatically convert types like other databases. We'll need to explicitly cast our date strings as DATE.
The query above took 3 seconds to complete and scanned 1.5GB of data costing a whopping 1 cent. We see that partitioning is indeed working because our S3 footprint is currently over 1.7TB.
My original thought was to partition by date and hour. However with these prices and the time it takes for Spark to write large number of partitions to S3 I think date partitioning is the best choice.
Let's compare Performance and Cost with our Spark Job
My go to Spark Cluster consists of one master m3.xlarge and four r3.8xlarge all using spot prices. This gives me a combined 488 GB of memory and 1.2 TB of SSD disk space. This cluster is not large enough to hold our entire dataset but all of our queries will be looking a smaller subset of that data.
The maximum per hour cost of this cluster is $5 but current spot pricing is around $0.36 so my cluster will probably cost $1.80hr.
The same query in Spark took 45 seconds.
Conclusion
If you can stage your data in a columnar format like Parquet and with partitions that match your queries Athena is a clear winner for adHoc analytical queries.
My next test test for Athena will be benchmarking join performance.