Data Engineering Certification Prep Series – Tip #11

Data Engineering Certification Prep Series – Tip #11

Migrating Hive Metastore to AWS Glue Data Catalog for a Serverless Future

Problem

A company is planning to migrate its on-premises Apache Hadoop clusters to Amazon EMR. Along with this, the company needs to migrate its Hive metastore, which is currently stored on-premises. The new solution must be:

  • Persistent (metadata should not be lost when clusters shut down)
  • Serverless (no operational overhead of managing databases)
  • Cost-effective

Options

A. Use AWS DMS to migrate the Hive metastore into Amazon S3. Configure AWS Glue Data Catalog to scan Amazon S3.

B. Configure a Hive metastore in Amazon EMR. Migrate the existing Hive metastore into Amazon EMR. Use AWS Glue Data Catalog to store the company’s external data catalog.

C. Configure an external Hive metastore in Amazon EMR. Migrate the existing Hive metastore into Amazon EMR. Use Amazon Aurora MySQL to store catalog.

D. Configure a new Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use the new metastore as the company’s data catalog.

Options Analysis

A. Use AWS DMS to migrate the Hive metastore into Amazon S3. Configure AWS Glue Data Catalog to scan Amazon S3.

  • The Hive metastore is not just data files; it’s metadata stored in a relational DB (often MySQL/Postgres).
  • You can’t just dump it into S3 and expect Glue to replicate Hive table definitions.
  • Glue needs schema definitions, not raw DB dumps.

B. Configure a Hive metastore in Amazon EMR. Migrate the existing Hive metastore into Amazon EMR. Use AWS Glue Data Catalog to store the company’s external data catalog.

  • AWS Glue Data Catalog is the serverless, persistent catalog that integrates natively with EMR, Athena, and Redshift Spectrum.
  • You migrate your Hive metadata → Glue Data Catalog.
  • This removes the need to manage an external Hive DB.
  • Cost-effective & serverless (Glue is pay-per-use).

C. Configure an external Hive metastore in Amazon EMR. Migrate the existing Hive metastore into Amazon EMR. Use Amazon Aurora MySQL to store catalog.

  • Aurora is highly available, but it is not serverless in the same sense as Glue Data Catalog.
  • Adds cost + operational overhead vs Glue.

D. Configure a new Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use the new metastore as the company’s data catalog.

  • EMR clusters are often ephemeral. If the cluster is shut down, the Hive metastore is lost unless backed externally.
  • This violates persistence + serverless requirements.


Correct Answer: B

Use AWS Glue Data Catalog as the persistent, serverless solution for Hive metastore migration.


Key Takeaways

  • Migrating Hive metastore → AWS Glue Data Catalog is the best practice for serverless, persistent metadata management.
  • Avoid managing external databases for Hive metastore unless you have a strong operational requirement.
  • Glue integrates seamlessly with EMR and other AWS analytics services, reducing cost and complexity.


References


To view or add a comment, sign in

More articles by Jayesh Shinde

Others also viewed

Explore content categories