Hello Athena
AWS re:Invent Keynote Dec 1st 2016 Photo: Jonathan Spooner

Hello Athena

Last December I attended my second AWS re:Invent conference in Las Vegas. One of the many new services released was Athena a managed query service powered by PrestoDB.

PrestoDB is a distributed ANSI SQL query engine for data sources ranging from gigabytes to petabytes. Much like Apache Drill and SparkSQL, it allows you to read and join a variety of data sources. It's a also a schema on read database unlike traditional data warehousing solutions we use like Redshift or Snowflake.

Athena handles all of the infrastructure and you simply submit querys and pay $5 per TB of data scanned. The use of a columnar storage formats like Parquet and Hive partitioning will help limit the about of data scanned.

Today, I will be taking my first hands on look at Athena to see if we can decrease cost over some of our Spark Jobs.

I have 1.7 TB of unpartitioned Parquet data so a query across all the columns would cost $9.70.

The queries I will be submitting are all temporal in nature so partitioning by year, month, day and hour should improve query time and decrease cost. However partitioning by day and hour would produce 8,760 partitions. In my experience writing this many partitions to S3 is slow and error prone so I will partition by Date (yyyy-MM-dd) for this test.

You should also know that storing your data in a columnar format like Parquet can also reduce cost because PrestoDB will only scan the necessary columns.

Let's start by loading up our data into a Spark program.

Next, we can use SparkSQL to add a column for our partitioning strategy.

And finally we can partition our data and save it back to S3.

This Spark job will kick off a big network shuffle. You can watch your cluster network stats in Ganglia while you wait or you can go grab coffee.

Creating Tables

Athena uses terms from traditional relational databases like table and database, however the underlining structure is much different. For example tables are created with the Hive DDL and it manages an internal catalog of the data. You can safely drop tables without effecting the original data stored on S3.

Let's create a table for the data we just staged.

You will likely run a query at this time and get empty results. This is because the partitions are not loaded yet in the Athena Catalog Manager. We resolve this issue by running.

Ok we are now ready to test some of our queries.

Submitting Queries to Athena

At the time of writing Athena only supports submitting queries via the UI or JDBC. Presto it's self also offers a command line interface and the Terradata distribution has added support for ODBC.

First you need to know that Presto doesn't automatically convert types like other databases. We'll need to explicitly cast our date strings as DATE.

The query above took 3 seconds to complete and scanned 1.5GB of data costing a whopping 1 cent. We see that partitioning is indeed working because our S3 footprint is currently over 1.7TB.

My original thought was to partition by date and hour. However with these prices and the time it takes for Spark to write large number of partitions to S3 I think date partitioning is the best choice.

Let's compare Performance and Cost with our Spark Job

My go to Spark Cluster consists of one master m3.xlarge and four r3.8xlarge all using spot prices. This gives me a combined 488 GB of memory and 1.2 TB of SSD disk space. This cluster is not large enough to hold our entire dataset but all of our queries will be looking a smaller subset of that data.

The maximum per hour cost of this cluster is $5 but current spot pricing is around $0.36 so my cluster will probably cost $1.80hr.

The same query in Spark took 45 seconds.

Conclusion

If you can stage your data in a columnar format like Parquet and with partitions that match your queries Athena is a clear winner for adHoc analytical queries.

My next test test for Athena will be benchmarking join performance.



To view or add a comment, sign in

More articles by Jonathan Spooner

  • Hello Athena Part II

    In my previous post Hello Athena I took an initial look at AWS Athena to see if we could reduce cost on some of our…

    1 Comment
  • Alexa ask Santa Claus...

    This morning I received an email about an upcoming Hello Alexa event in Los Angeles and I remember I never posted this…

    1 Comment
  • The Future of Your Vehicle Data

    Ford Motor Company hosted a developer conference and hackathon at Ctia Super Mobility in Las Vegas. The speakers…

    3 Comments
  • BACK in the DAY

    Over the past two or three years my good friend Ozzie Ausband has been putting his blood and sweat into a skateboarding…

    4 Comments
  • Ford Developer Conference at Ctia Super Mobility

    Yesterday, Ford Motor Company hosted a developer conference and hackathon at Ctia Super Mobility in Las Vegas. The…

  • Software Engineer for Innovation Team

    Are you right for this role? Are you a software engineer by trade but entrepreneur at heart? The right candidate will…

Others also viewed

Explore content categories