Data Engineering Certification Prep Series – Tip #11

Jayesh Shinde

Published Sep 19, 2025

Migrating Hive Metastore to AWS Glue Data Catalog for a Serverless Future

Problem

A company is planning to migrate its on-premises Apache Hadoop clusters to Amazon EMR. Along with this, the company needs to migrate its Hive metastore, which is currently stored on-premises. The new solution must be:

Persistent (metadata should not be lost when clusters shut down)
Serverless (no operational overhead of managing databases)
Cost-effective

Options

A. Use AWS DMS to migrate the Hive metastore into Amazon S3. Configure AWS Glue Data Catalog to scan Amazon S3.

B. Configure a Hive metastore in Amazon EMR. Migrate the existing Hive metastore into Amazon EMR. Use AWS Glue Data Catalog to store the company’s external data catalog.

C. Configure an external Hive metastore in Amazon EMR. Migrate the existing Hive metastore into Amazon EMR. Use Amazon Aurora MySQL to store catalog.

D. Configure a new Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use the new metastore as the company’s data catalog.

Options Analysis

A. Use AWS DMS to migrate the Hive metastore into Amazon S3. Configure AWS Glue Data Catalog to scan Amazon S3.

The Hive metastore is not just data files; it’s metadata stored in a relational DB (often MySQL/Postgres).
You can’t just dump it into S3 and expect Glue to replicate Hive table definitions.
Glue needs schema definitions, not raw DB dumps.

B. Configure a Hive metastore in Amazon EMR. Migrate the existing Hive metastore into Amazon EMR. Use AWS Glue Data Catalog to store the company’s external data catalog.

AWS Glue Data Catalog is the serverless, persistent catalog that integrates natively with EMR, Athena, and Redshift Spectrum.
You migrate your Hive metadata → Glue Data Catalog.
This removes the need to manage an external Hive DB.
Cost-effective & serverless (Glue is pay-per-use).

Recommended by LinkedIn

AWS Cloud Cafe: Data Engineering Cuisine Menu

Nitin Aggarwal 1 year ago

Fundamental Big Data Computing Solution Based on AWS…

Ge Wu 5 years ago

AWS Tools for Big Data Engineering: Enabling Scalable…

Rafael Andrade 1 year ago

C. Configure an external Hive metastore in Amazon EMR. Migrate the existing Hive metastore into Amazon EMR. Use Amazon Aurora MySQL to store catalog.

Aurora is highly available, but it is not serverless in the same sense as Glue Data Catalog.
Adds cost + operational overhead vs Glue.

D. Configure a new Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use the new metastore as the company’s data catalog.

EMR clusters are often ephemeral. If the cluster is shut down, the Hive metastore is lost unless backed externally.
This violates persistence + serverless requirements.

Correct Answer: B

Use AWS Glue Data Catalog as the persistent, serverless solution for Hive metastore migration.

Key Takeaways

Migrating Hive metastore → AWS Glue Data Catalog is the best practice for serverless, persistent metadata management.
Avoid managing external databases for Hive metastore unless you have a strong operational requirement.
Glue integrates seamlessly with EMR and other AWS analytics services, reducing cost and complexity.

References

Data Engineering Certification Prep Series – Tip #11

Jayesh Shinde

Migrating Hive Metastore to AWS Glue Data Catalog for a Serverless Future

Problem

Options

Options Analysis

Recommended by LinkedIn

Correct Answer: B

Key Takeaways

References

AWS Data Engineer Associate

1,404 follower

More articles by Jayesh Shinde

Others also viewed

Building a Scalable Data Pipeline with AWS Glue, Redshift, and Apache Airflow

Migrate to Apache Iceberg on AWS: A Strategic Guide for Modern Data Lakes

How to Add Custom Spark Listener Logs to the AWS EMR UI

Query in AWS RedShift

Exploring Data Analytics on the Databricks Platform

My Experience with Passing AWS Certified Big Data Specialty Exam

AWS Data Lakes + Glue & Athena

Passed the AWS Certified Big Data - Specialty (August 2019)

Data Archtechure on AWS

Data Services Evolution: Highlights of AWS Transformations in 2023 and Anticipations for 2024"

Explore content categories

Migrating Hive Metastore to AWS Glue Data Catalog for a Serverless Future

Problem

Options

Options Analysis

Recommended by LinkedIn

Correct Answer: B

Key Takeaways

References

AWS Data Engineer Associate

1,404 follower

More articles by Jayesh Shinde

Networking Speciality Certification Prep Series - Tip #22

Data Engineering Certification Prep Series – Tip #32

Data Engineering Certification Prep Series – Tip #31

Data Engineering Certification Prep Series – Tip #30

Networking Speciality Certification Prep Series - Tip #21

Data Engineering Certification Prep Series – Tip #29

Networking Speciality Certification Prep Series - Tip #20

Data Engineering Certification Prep Series – Tip #28

Networking Speciality Certification Prep Series - Tip #19

Data Engineering Certification Prep Series – Tip #27

Others also viewed

Building a Scalable Data Pipeline with AWS Glue, Redshift, and Apache Airflow

Migrate to Apache Iceberg on AWS: A Strategic Guide for Modern Data Lakes

How to Add Custom Spark Listener Logs to the AWS EMR UI

Query in AWS RedShift

Exploring Data Analytics on the Databricks Platform

My Experience with Passing AWS Certified Big Data Specialty Exam

AWS Data Lakes + Glue & Athena

Passed the AWS Certified Big Data - Specialty (August 2019)

Data Archtechure on AWS

Data Services Evolution: Highlights of AWS Transformations in 2023 and Anticipations for 2024"

Explore content categories