Data Pipe lining in AWS
Few days back, I was reading articles on AWS data pipeline. Found few, but most of them were providing insights using CLI only.
So,let's build a simple data pipeline using AWS console.
Scope: Load data from RDS to S3, with some data manipulation using SQL
Input : RDS
Output : S3
Extraction/Transformation plan: Using SQL queries.
Step 1: Go to data pipeline from AWS services and create a new pipeline.
Note : Data pipeline is currently available for few regions only. So select appropriate one.
Step 2: Fill all necessary details like name, scheduling information etc.
Note: You can choose "source" as build using template or build using architect.
AWS provides few templates. In our case we will build our own architect.
Step 3: Click "Edit in Architect" and it will open a window , where we can design our workflow.
Step 4: From "Add" button, include copy activity.
Step 5: Go to activities, and there you can see your newly created copy activity.
In output , give output data node name or create new data node. Similarly do for input.
Strep 6: Go to data node and select S3DataNode for output as " type". From "Add an optional field", select Directory Path and give your s3 path (s3://<yourS3bucketname>)
For input data node, select SqlDataNode. In "table" text box, provide your table name.
From "Add an optional field" , select "Select Query" and provide your sql script here.
Note: You can join multiple tables in select query. In table text box, no need to provide all table names. Provide only one out of joining tables.
Now click on "Add an optional field" for input data node and select Database.
Select create new database. A new database will be created in "Others" .
Go to "Other" and fill all required fields for database instance. You can give some additional optional fields also ,like database name etc.
Step 7: You need to provide EC2/EMR instance on which your pipe line will run.
Go to activities, where your copy activity is present and click on "Add an optional field".Click on "Runs On" and select a resource or create a new resource.
A new resource will be created in "Resource" . Select EC2/EMRcluster accordingly.
Note: It will create a dynamic instance for the job which will get terminated after the job is completed. You can also provide your own instance , on which data pipe line will run.
Your Data pipe line is ready. Save it and activate it according to your requirement.
-Arpit Goel
Interesting concept and I believe this concept will soon become more popular