Part 2: Efficient Data Exploration Using Amazon Athena

Part 2: Efficient Data Exploration Using Amazon Athena

Welcome back to my blog series, "Get Hands-On: Learn Skills for the AWS Data Engineer Associate Certification."

In this post, we’ll walk through the basics of setting up Amazon Athena, creating databases and tables, and optimizing your queries using efficient data formats like Parquet. Whether you’re preparing for the AWS Data Engineer Associate Certification or just looking to sharpen your data querying skills, this guide will give you practical, hands-on experience with Athena that you can immediately apply to your projects.

Section 1: What is Amazon Athena?

I like to describe it as a great tool for ad-hoc analysis. No complex ETL processing or Glue jobs needed.  At it's core is a serverless query service that allows you to analyze data directly in Amazon S3 using standard SQL, and you only pay for the data you actually scan.

Article content

Here's some cool things you can do with Athena:

  • Query data in S3 without ETL
  • Analyze CloudTrail Logs, VPC Flow Logs, and Application Logs
  • Perform Federated Queries against multiple data sources at once, including even other cloud providers and on-premises data


Section 2: Setting Up Athena

Let's get started in the AWS Console by creating a database in Athena.

  • Go to the Athena landing page by typing 'Athena' in the AWS Console search bar.
  • In the get started block, select ‘Query your data with Trino SQL’ This is the classic Athena experience – Trino was formerly known as PrestoSQL.

Article content
This section on the landing page recently updated, thought I'd share my notes on it

  • Create a new database by executing a simple SQL command:

CREATE DATABASE jorges_database;        
Article content
Fairly straightforward experience, here are some callouts

Let's clarify what we see on this screen:

  • In Athena, a Database groups tables and defines metadata about data in S3, but doesn’t store the data itself. Assigning tables to a database helps organize and manage them for easier querying.
  • Athena uses the AWS Glue Data Catalog by default for storing databases and table definitions. If you don't create a database, Athena uses the default Glue database, labeled "default."
  • Reuse query results: When you run a query in Athena, results are cached for around 60 minutes by default. If you select this option and run the same query within that time-frame, your data will not get scanned again. Since Athena charges based on data scanned, this can produce some nice savings.


Section 3: Querying Data from S3

Next, let's create an external table in Athena that points to our data in S3. When we define an external table, Athena creates the schema, but the actual data stays on S3.

Let's use the sample data from part 1 of the blog, with one small tweak, I'm removing the [ and ] symbols at the beginning and end of the file, when I kept that, Athena interpreted them as rows. Here's the data, save it as .json and upload to S3.

{"id": 1, "name": "Logan", "age": 197, "department": "Security"},
{"id": 2, "name": "Wade Wilson", "age": 35, "department": "Human Resources"},
{"id": 3, "name": "Charles Xavier", "age": 93, "department": "Executive"}        

Now let's run this SQL command to create the table:

CREATE EXTERNAL TABLE my_table (
    id INT,
    name STRING,
    age INT,
    department STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
    'ignore.malformed.json' = 'true'
)
LOCATION 's3://jorge-glue-input-bucket/'
TBLPROPERTIES ('has_encrypted_data'='false');        

SQL Explanation

 Let's break down the key components in the table we just created.

ROW FORMAT: this clause specifies how the rows of data are structured and stored in the table. It dictates how Athena should parse and interpret each row of your data.

SERDE: specifies the Serializer/Deserializer (SerDe) that Athena uses to read data from your input files and convert it into a format that can be queried with SQL. Different SerDes are available for different data formats, such as JSON, Parquet, CSV, etc.

LOCATION: this clause points to the S3 bucket or directory where your actual underlying data is stored.


Section 4: Using Athena's CTAS Feature to convert to Parquet

As we learned in part 1 of this blog series, Parquet is an optimal format for both Athena and Glue as opposed to CSV or JSON. You may be wondering: "Why bother with this?". In short, it makes your queries faster and less expensive. In my experience, it's no fun waiting more than a minute for your query results to come back.

As a refresher, Parquet offers these advantages:

1. Columnar Storage: Parquet stores data by columns, so Athena scans only what’s needed.

2. Data Compression: Parquet's columnar structure makes it highly compressible, which leads to smaller file sizes, smaller file sizes equates to less data scanned, and less that you have to pay.

3. Optimized for Analytics: Parquet scales well with filtering and aggregations, the more complex your data gets, the more Parquet's benefits become apparent.

Athena has a very straightforward way to convert your JSON data into Parquet. Let's do that now.  The method is via a "Create Table as Select" query, and we specify the format as Parquet and an output location.

CREATE TABLE my_table_parquet
WITH (
  format = 'PARQUET',
  external_location = 's3://your-bucket/processed-data/'
) AS
SELECT * FROM my_table;        
Article content
Easy data conversion

Let's take a look at the new processed file. When I did it, I wondered: Is it really a parquet file? Here's what happened when I tried to open it in a text editor:


Article content
Clear as mud!

Yes: Note that it is a binary file, so it doesn't open nicely or human readable in windows, though you can make out some data file entries, as well as the word hive_schema towards the bottom, which is metadata that allows Apache Hive, and by extension, Athena to efficiently query the data without having to scan through the entire dataset.

Another note, you may notice , that if you query each table with a select * statement, that the Parquet results load about 200 milliseconds slower. This is expected as we are dealing with a very small data set, and also because there is compression taking place with Parquet. As you query larger, more complex datasets, the speed benefits for Parquet become pronounced.


Section 5: Best Practices for Using Athena

Now that we've familiarized ourselves with Athena and how to use CTAS, let's review some best practices:

  • Partitioning Data: Limits data scanned by Athena, improving query performance and reducing costs by focusing only on relevant partitions.
  • Query Caching: Speeds up repeated queries by caching results, saving time and costs by avoiding data rescanning.
  • Security: Protect your data by using IAM roles with least-privilege permissions and VPC endpoints for secure traffic routing.


Section 6: Troubleshooting Common Issues

Here’s how to address common issues in Athena:

Permissions Errors

Make sure that the IAM role associated with Athena has the required s3:GetObject, s3:PutObject, and s3:ListBucket permissions, and check your S3 bucket policy to allow access. 


Data Format Issues

If you encounter issues with blank or incorrect data, it may be due to mismatched SERDE settings or data format inconsistencies. Make sure that the SERDE in your table definition matches the data format, and that your data files match the table schema.


Conclusion

In this post, we got setup with Amazon Athena, created a database, and explored some of it's features. We discussed how using Parquet can optimize your Athena use, and learned how to convert data in JSON to Parquet using a simple Select statement directly in Athena.  We also reviewed best practices like partitioning, compression, and query caching, along with tips for troubleshooting common issues. Stay tuned for the next installment for more hands-on learning!

To view or add a comment, sign in

More articles by Jorge Rodriguez

Others also viewed

Explore content categories