Part 2: Efficient Data Exploration Using Amazon Athena
Welcome back to my blog series, "Get Hands-On: Learn Skills for the AWS Data Engineer Associate Certification."
In this post, we’ll walk through the basics of setting up Amazon Athena, creating databases and tables, and optimizing your queries using efficient data formats like Parquet. Whether you’re preparing for the AWS Data Engineer Associate Certification or just looking to sharpen your data querying skills, this guide will give you practical, hands-on experience with Athena that you can immediately apply to your projects.
Section 1: What is Amazon Athena?
I like to describe it as a great tool for ad-hoc analysis. No complex ETL processing or Glue jobs needed. At it's core is a serverless query service that allows you to analyze data directly in Amazon S3 using standard SQL, and you only pay for the data you actually scan.
Here's some cool things you can do with Athena:
Section 2: Setting Up Athena
Let's get started in the AWS Console by creating a database in Athena.
CREATE DATABASE jorges_database;
Let's clarify what we see on this screen:
Section 3: Querying Data from S3
Next, let's create an external table in Athena that points to our data in S3. When we define an external table, Athena creates the schema, but the actual data stays on S3.
Let's use the sample data from part 1 of the blog, with one small tweak, I'm removing the [ and ] symbols at the beginning and end of the file, when I kept that, Athena interpreted them as rows. Here's the data, save it as .json and upload to S3.
{"id": 1, "name": "Logan", "age": 197, "department": "Security"},
{"id": 2, "name": "Wade Wilson", "age": 35, "department": "Human Resources"},
{"id": 3, "name": "Charles Xavier", "age": 93, "department": "Executive"}
Now let's run this SQL command to create the table:
CREATE EXTERNAL TABLE my_table (
id INT,
name STRING,
age INT,
department STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json' = 'true'
)
LOCATION 's3://jorge-glue-input-bucket/'
TBLPROPERTIES ('has_encrypted_data'='false');
SQL Explanation
Let's break down the key components in the table we just created.
ROW FORMAT: this clause specifies how the rows of data are structured and stored in the table. It dictates how Athena should parse and interpret each row of your data.
SERDE: specifies the Serializer/Deserializer (SerDe) that Athena uses to read data from your input files and convert it into a format that can be queried with SQL. Different SerDes are available for different data formats, such as JSON, Parquet, CSV, etc.
LOCATION: this clause points to the S3 bucket or directory where your actual underlying data is stored.
Recommended by LinkedIn
Section 4: Using Athena's CTAS Feature to convert to Parquet
As we learned in part 1 of this blog series, Parquet is an optimal format for both Athena and Glue as opposed to CSV or JSON. You may be wondering: "Why bother with this?". In short, it makes your queries faster and less expensive. In my experience, it's no fun waiting more than a minute for your query results to come back.
As a refresher, Parquet offers these advantages:
1. Columnar Storage: Parquet stores data by columns, so Athena scans only what’s needed.
2. Data Compression: Parquet's columnar structure makes it highly compressible, which leads to smaller file sizes, smaller file sizes equates to less data scanned, and less that you have to pay.
3. Optimized for Analytics: Parquet scales well with filtering and aggregations, the more complex your data gets, the more Parquet's benefits become apparent.
Athena has a very straightforward way to convert your JSON data into Parquet. Let's do that now. The method is via a "Create Table as Select" query, and we specify the format as Parquet and an output location.
CREATE TABLE my_table_parquet
WITH (
format = 'PARQUET',
external_location = 's3://your-bucket/processed-data/'
) AS
SELECT * FROM my_table;
Let's take a look at the new processed file. When I did it, I wondered: Is it really a parquet file? Here's what happened when I tried to open it in a text editor:
Yes: Note that it is a binary file, so it doesn't open nicely or human readable in windows, though you can make out some data file entries, as well as the word hive_schema towards the bottom, which is metadata that allows Apache Hive, and by extension, Athena to efficiently query the data without having to scan through the entire dataset.
Another note, you may notice , that if you query each table with a select * statement, that the Parquet results load about 200 milliseconds slower. This is expected as we are dealing with a very small data set, and also because there is compression taking place with Parquet. As you query larger, more complex datasets, the speed benefits for Parquet become pronounced.
Section 5: Best Practices for Using Athena
Now that we've familiarized ourselves with Athena and how to use CTAS, let's review some best practices:
Section 6: Troubleshooting Common Issues
Here’s how to address common issues in Athena:
Permissions Errors
Make sure that the IAM role associated with Athena has the required s3:GetObject, s3:PutObject, and s3:ListBucket permissions, and check your S3 bucket policy to allow access.
Data Format Issues
If you encounter issues with blank or incorrect data, it may be due to mismatched SERDE settings or data format inconsistencies. Make sure that the SERDE in your table definition matches the data format, and that your data files match the table schema.
Conclusion
In this post, we got setup with Amazon Athena, created a database, and explored some of it's features. We discussed how using Parquet can optimize your Athena use, and learned how to convert data in JSON to Parquet using a simple Select statement directly in Athena. We also reviewed best practices like partitioning, compression, and query caching, along with tips for troubleshooting common issues. Stay tuned for the next installment for more hands-on learning!