From the course: Data Engineering on AWS: Data Cataloging, Processing, Analytics, and Visualization
Glue overview - Amazon Web Services (AWS) Tutorial
From the course: Data Engineering on AWS: Data Cataloging, Processing, Analytics, and Visualization
Glue overview
- [Instructor] AWS Glue is a serverless, fully managed and cloud optimized ETL, or extract, transform, and load service. One of the hardest part of any analytics or data warehousing application is to set up and maintain a reliable ETL process, which helps you understand your data, such as transformations, and generates ETL code, so you can spend less time in coding. Glue consists of a central metadata repository, known as Glue Data Catalog. Catalog is a fundamental of Glue, which contains all the metadata and sits right here in the center. It has Spark ETL engine, which is completely serverless, and generate the Python code. Glue's. Flexible Scheduler runs your ETL jobs and handles dependency resolution, job monitoring, and retries. If you look at the diagram, Glue data catalog sits in the middle and it has connections, tables, settings, and transformations. And from Glue Data Catalog, you can create jobs. Jobs can use metadata information to create the scripts, and these scripts can be scheduled to move data from source to target. And Glue Data Catalog can be accessed via console or in a programmatic way. Together this automates the heavy lifting involved in discovering, categorizing, cleansing, enriching, and moving data. Glue automatically discovers your data, determines your schema, and build your data catalog. The Glue Data Catalog provides the out-of-the-box integration with Amazon Athena, EMR, and Redshift Spectrum. The ETL code, which Glue generates, is just a Python code that is entirely customizable, reusable, and portable. You can edit this code using your favorite IDE or notebook, and share with others using GitHub. Finally, Glue is serverless. There are no resources for you to manage, and you only pay for the usage of resources while they run. Now, when should we use Glue? Glue can be used to build data warehouse to organize cleanse, validate, and format data. You can transform, as well as move, data into your data store. You can also load data from disparate sources into your data warehouse for regular reporting and analysis. Glue can also be used to run your serverless queries against your S3 data lake. Glue can catalog your S3 data, making it available for querying with Athena and Redshift Spectrum. With crawlers, your metadata stays in synchronization with the underlying data. Athena and Redshift Spectrum can directly query your S3 data lake with the help of Glue Data Catalog. With Glue you can access, as well as analyze data, through one unified interface without loading into multiple data silos. You can also use Glue when you want to create event-driven data pipelines. You can run your ETL jobs as soon as new data becomes available in S3 Bucket by invoking AWS Glue ETL jobs from Lambda function. You can also register this new dataset in Glue considering it part of your ETL jobs. You can also use Glue to understand your data assets. You can store your data using various AWS services and still maintain a unified view of your data using Glue Data Catalog. You can view the data catalog to quickly search and discover the datasets that you own and maintain the relevant metadata in one central repository. The data catalog also serves as a drop-in replacement for your external Apache Hive Metastore.