Data Pipe lining in AWS

Arpit Goel

Published Oct 31, 2017

Few days back, I was reading articles on AWS data pipeline. Found few, but most of them were providing insights using CLI only.

So,let's build a simple data pipeline using AWS console.

Scope: Load data from RDS to S3, with some data manipulation using SQL

Input : RDS

Output : S3

Extraction/Transformation plan: Using SQL queries.

Step 1: Go to data pipeline from AWS services and create a new pipeline.

Note : Data pipeline is currently available for few regions only. So select appropriate one.

Step 2: Fill all necessary details like name, scheduling information etc.

Note: You can choose "source" as build using template or build using architect.

AWS provides few templates. In our case we will build our own architect.

Step 3: Click "Edit in Architect" and it will open a window , where we can design our workflow.

Step 4: From "Add" button, include copy activity.

Step 5: Go to activities, and there you can see your newly created copy activity.

In output , give output data node name or create new data node. Similarly do for input.

Strep 6: Go to data node and select S3DataNode for output as " type". From "Add an optional field", select Directory Path and give your s3 path (s3://<yourS3bucketname>)

For input data node, select SqlDataNode. In "table" text box, provide your table name.

From "Add an optional field" , select "Select Query" and provide your sql script here.

Note: You can join multiple tables in select query. In table text box, no need to provide all table names. Provide only one out of joining tables.

Now click on "Add an optional field" for input data node and select Database.

Select create new database. A new database will be created in "Others" .

Go to "Other" and fill all required fields for database instance. You can give some additional optional fields also ,like database name etc.

Step 7: You need to provide EC2/EMR instance on which your pipe line will run.

Go to activities, where your copy activity is present and click on "Add an optional field".Click on "Runs On" and select a resource or create a new resource.

A new resource will be created in "Resource" . Select EC2/EMRcluster accordingly.

Note: It will create a dynamic instance for the job which will get terminated after the job is completed. You can also provide your own instance , on which data pipe line will run.

Your Data pipe line is ready. Save it and activate it according to your requirement.

-Arpit Goel

M Dattatraya Rao - Enterprise Architect -Digital Consulting, graphic

M Dattatraya Rao - Enterprise Architect -Digital Consulting 8y

Interesting concept and I believe this concept will soon become more popular

1 Reaction

To view or add a comment, sign in

Data Pipe lining in AWS

Arpit Goel

More articles by Arpit Goel

Others also viewed

Building Incremental Join Workflows between Hudi Tables and DynamoDB in AWS Glue: A Beginner's Hands-on Guide

Can AWS DataZone solve modern data sprawl?

Microsoft Fabric Is the Data Magnet For Your Organization

Step-by-Step Guide to Creating a Copy Activity Pipeline in Azure Data Factory

Export the Dataverse data from Azure Synapse Analytics using Azure Data Factory

How to test Azure Data Pipeline?

Azure Data & Analytics Updates - June 2020

Mastering Azure Data Factory: A Deep Dive with Hands-On Implementation

🚀 Azure Data – Key Announcements from Ignite 2025

The Hitchhiker’s Guide to AWS Redshift — part 2: Let’s cut costs

Explore content categories

More articles by Arpit Goel

Business Smart: A recipe of Intelligence and Analytics

The Two M’s of Data

Converting RDD to Data frame with header in spark-scala