Fetch data from Redshift using AWS Glue
Recently I worked on fetching data from redshift using AWS glue and came across multiple challenges and it was quite a learning.So writing this article with all my leanings and challenges faced during this project.
For big data services on AWS specially Spark , it provide two ways
- EMR
- Glue
EMR cluster is more or less like having cloudera/Hortonworks distribution on your in premise servers but it will be hosted on AWS infrastructure.Its cost it bit high as you will have the dedicated servers and they will be up all the time.
Glue is serverless so you pay for the time when you are using it and you do not have to worry about where servers are running or how many executors and core to use and all that stuff. You can just focus on your problem and leave everything else on AWS. Only tricky part is as soon as a glue job is started you will be billed for 10 minutes even though your job complete before that.This happens because when glue job start, a cluster will be spun up which will be active for 10 minutes by default.We can decide the size of cluster by giving number of DPU for the glue job.I will write another blog with details about how DPU works and how to select optimal number of DPU's for your job.
Redshift is a managed data warehouse cluster on cloud which can scale data from few hundred gigabytes to petabytes.I have a redshift cluster created and need to fetch data from it for comparing the data i have in s3 data lake and reprocess any missing or updated data.
Spark provides a direct way to connect to redshift and get the data using JDBC. But when we are doing it through Glue there are some tricky things that we need to handle.So if we want to access data from any RDBMS like redshift or SQL, glue can not directly connect to these databases because redshift and mysql runs under a VPC always and glue should know the VPC details to connect to it.So we need to create a glue connection first and then add this connection to your glue etl job so that it can connect to the required database and pull the data.
Lets go step by step on what all is required to achieve it
- Glue Connection
- Inbound rule for glue to connect to redshift
- IAM role for glue job to use redshift services
- Unload role used by redshift cluster to unload the data from redshift to s3 data lake
5. Glue ETL job by using the glue connection
- Glue Connection -- Before creating a connection we need few details like below
- JDBC url of the database
- subnet
- security group
- username
- password
JDBC url and subnet details can be found on the redshift page on aws console.Just type redshift in the services and it will take you to redshift page.Select your cluster and there you find all the above mentioned details .
Now with all these details lets create a connection in Glue.Go to Glue page from aws console and you will see connections option under Data Catalog -- > databases --> connections. Create a new connection from here by providing the required details.
Once connection is created , test the connection.You will get below error.
At least one security group must open all ingress ports. To limit traffic, the source security group in your inbound rule can be restricted to the same security group
This error indicates that for the security group you are using inbound rule is not added for self referencing.So lets add it.
2. Inbound Rule for Redshift -- As redshift runs under VPC so we need to create a self referencing inbound rule for the security group used by redshift. We already get the details of the security group used by redshift from redshift page.Now to add an inbound rule to it go to ec2 page and select Security Groups under Network and Security group and search for the security group name that you got from redshift page.One other way to reach on this page is just click on the security group from redshift page and it will directly take you there.Now at the bottom you will see four tabs - description,inbound,outbound,tags.Click on inbound and say edit.Now add the inbound rule for security group as per below screenshot.Give the security group name(the same security group name for which you are making the changes because its self referencing) in the highlighted field
Once this is saved you can go to glue connections and test the connection created in first step and it should be successful.
3. IAM Role --- While creating IAM role keep 2 things in mind
- As we are connecting to rdbms which is on a VPC so the cluster that will be created on demand will be using the ec2 instance from your account which doesn't happen in case of normal glue etl job. So we need to give access to ec2 instances.Once an ec2 instance is created while running the glue etl job an eni will be assigned which can be seen from ec2 --> Network & security --> network interface. This network interface normally gets deleted automatically within 10 minutes but in this particular scenario, where we are connecting to VPC, it needs to be deleted. So your role must have the access to delete the network interface it has created.If you miss this action in your role eni's will be created every time your job is running. Every security group has a dedicated number of IPs assigned. Now as eni's are not getting deleted so IPs will not be released back and soon you will be running out of the IPs and your connection will start failing.So you must add the delete network interface action in your IAM policy. I am pasting here an policy that has all the required access a glue job will need for ec2.
{
"Effect": "Allow",
"Action": [
"ec2:DescribeSubnets",
"ec2:DescribeVpcEndpoints",
"ec2:DescribeRouteTables",
"ec2:CreateNetworkInterface",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeSecurityGroups",
"ec2:DescribeVpcAttribute",
"ec2:DescribeVpcs",
"ec2:DescribeKeyPairs",
"ec2:DescribeInstances",
"ec2:DescribeImages",
"ec2:DeleteNetworkInterface"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2:CreateTags",
"ec2:DeleteTags"
],
"Condition": {
"ForAllValues:StringEquals": {
"aws:TagKeys": [
"aws-glue-service-resource"
]
}
},
"Resource": [
"arn:aws:ec2:*:*:network-interface/*",
"arn:aws:ec2:*:*:security-group/*",
"arn:aws:ec2:*:*:instance/*"
]
}
- Along with ec2 your role must have redshift access also.So here is the actions you will need for redshift.
{
"Effect": "Allow",
"Action": [
"redshift:DescribeEventSubscriptions",
"redshift:DescribeOrderableClusterOptions",
"redshift:DescribeHsmClientCertificates",
"redshift:ViewQueriesInConsole",
"redshift:DescribeTags",
"redshift:DescribeClusterParameterGroups",
"redshift:DescribeSnapshotCopyGrants",
"redshift:DescribeReservedNodes",
"redshift:DescribeClusters",
"redshift:CancelQuerySession",
"redshift:GetClusterCredentials",
"redshift:DescribeClusterSecurityGroups",
"redshift:DeleteTags",
"redshift:DescribeTableRestoreStatus",
"redshift:DescribeEventCategories",
"redshift:DescribeClusterSubnetGroups",
"redshift:DescribeReservedNodeOfferings",
"redshift:DescribeClusterVersions",
"redshift:DescribeClusterSnapshots",
"redshift:DescribeDefaultClusterParameters",
"redshift:DescribeEvents",
"redshift:DescribeHsmConfigurations",
"redshift:PurchaseReservedNodeOffering",
"redshift:CreateTags"
],
"Resource": [
"arn:aws:redshift:*:*:cluster:*",
"arn:aws:redshift:*:*:dbgroup:*",
"arn:aws:redshift:*:*:dbuser:*",
"arn:aws:redshift:*:*:dbname:**"
]
}
- Apart from this you can add things in policy based on your requirement like if you want to see the logs in cloudwatch and want to access dynamo db or athena and off course s3 access.
4. Unload role used by redshift cluster to unload the data from redshift to s3 data lake
Whenever a redshift cluster is created a role will be assigned to it which will be used to unload the data.Now we have an IAM role which we created for our glue job and given it complete s3 access so that you can write data to the desired s3 bucket. But still your code will fail to write the data if access for bucket is not given to the unload role because this unload role is used internally by redshift to first unload the data and then it will be using the glue role to actually put data at the target location.So never miss the correct write access for s3 in the unload role.I had faced this issue after launching my product to prod and figured it out there.
5. Glue ETL job using the Glue Connection
Now we are done with all the prerequisites for glue etl job to connect to redshift but before that you need to keep the code in s3 as while creating glue job it need the script to be present in s3 and then you can start modifying the code according to your requirements.So have below code saved in s3 so that you can create glue etl job using this script
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext, SQLContext
from pyspark.sql import *
from pyspark.sql.types import *
def main():
sc = SparkContext()
sqlContext = SQLContext(sc)
#role_arn_for_redshift is the unload role created while creating the redshift cluster and can be found on the redshift page
#url can also be found on te redshift page.Its the jdbc url for particular cluster that you want to access
#temp_log_path must be an s3 location and unload role must have write access to this location as it will be used to unload the file temporarily
df_redshift = sqlContext.read.format("com.databricks.spark.redshift").option("url","{url}?user={user}&password={pwd}".format(user=user, pwd=pwd,url=redshift_url)).option("aws_iam_role", "{role_arn}".format(role_arn=role_arn_for_redshift)).option("query",redshift_query).option("tempdir", temp_log_path).load()
print('number of records from redshift for table {table} is {count}'.format(table=table,count=df_redshift.count()))
sc.stop()
if __name__ == "__main__":
main()
Now create an etl job and give path where this script is located and select the IAM role created in step 3 and click on Next
Here you need to select the connection you created in first step by clicking on the select button next to it and click on save and edit script.
You are ready to test the code now.Run glue job to test your code.Be patient while running it as its going to take around 10 minutes to start because its connecting to the VPC. Once it is connected you can keep running the code within 10 min duration.If you do not run it within 10 min duration , connection will be closed to VPC and in the next run it will be again taking around 10 minutes to establish the connection to VPC.
I hope this article will be useful for anyone who is trying to do it for the first time.I found it quite difficult initially as all the information is not present at one place. Happy Sharing :)