Configuration for Data sources in Databricks (python)
Hi Everyone,
In this article, I will be writing about the common data sources that we use on Databricks. I will be providing the configuration for following data sources.
- Azure SQL Database
- Azure Data lake gen 1
- Azure Data lake gen 2
- Azure blob storage
1. Azure SQL Database
To connect with Azure SQL DB, you need to provide following details like JdbcHostname,JdbPort,JdbcDatabase and in properties, you need to pass user and password.
Please refer the below screenshot:-
Please find the below code-
jdbcHostname = "*******.database.windows.net"
jdbcPort = "1433"
jdbcDatabase = "******"
properties =
{
"user": "amit",
"password": "******"
}
url = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
query_results = "(select * from dbo.EMP) as EMP"
#To read from Azure SQL database
df = spark.read.jdbc(url=url, table=query_results, properties=properties)
display(df)
2. Azure Data lake gen 1
To connect with Azure Data lake gen 1, you need to provide the below details serviceprincpleid(clientid),serviceprincplekey(credential) and directoryid.
In order to get serviceprincpleid, serviceprincplekey and directoryid you need to register an application under Azure Active Directory. Once you register an app, you will get all the necessary details to create your configuration.
Please refer the below screenshot to get see serviceprincpleid and directoryid.
Similarly, once app is registered, you can create key for your app. This will be your servicepricplekey. One you click on Add, you will get a secret(key) that you can copy right away, as it won't available after that.
Once this is completed, we need to give access to this app (databrickspoc) to our data lake. To achieve this, go to your data lake account and under Access Control (IAM) click on Add-> Add role assignment. Then Select the Role (I have selected owner in my case) then search for your app that you have created in your previous step then click on Save.
Once we have all those three connection details available, we can create our configuration to connect to Azure Data lake gen 1.
Please find the below code-
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": "*****************",
"dfs.adls.oauth2.credential": "***********",
"dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/*************/oauth2/token"}
dbutils.fs.mount(
source = "adl://adlsgen12020.azuredatalakestore.net/adlsgen1",
mount_point = "/mnt/mountadlsgen1",
extra_configs = configs)
#To read from Azure Data lake Gen1
df=spark.read.option("header","true").csv("/mnt/mountadlsgen1/Employee.txt").head(5)
display(df)
3. Azure datalake gen2
To connect with Azure Data lake gen 2, you need to provide the below details serviceprincpleid(clientid),serviceprincplekey(credential) and directoryid.(refer the steps mentioned above)
Once we have all those three connection details available, we can create our configuration to connect to Azure Data lake gen 2.
Please find the code-
serviceprincpleid="***********************"
serviceprincplekey="*****************************"
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id":serviceprincpleid,
"fs.azure.account.oauth2.client.secret":serviceprincplekey,
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/*************/oauth2/token"}
dbutils.fs.mount(
source = "abfss://source@a***.dfs.core.windows.net/",
mount_point = "/mnt/mountadlsgen2",
extra_configs = configs)
#To read from Azure Data lake Gen2
df=spark.read.option("header","true").csv("/mnt/mountadlsgen2/Employee.txt").head(5)
display(df)
4. Azure blob storage
To connect with Azure blob storage, you need to provide the below details like sas key. To generate sas key, go to your Storage Account and search for "Shared access signature" and click on Generate SAS and connection string and copy the Blob service SAS URL.
Please refer below screenshots.
Now, we are good to set up the azure blob service into our Databricks.
Please refer the below screenshot-
Please refer the code-
sasURL = "https://ma***geaccount2020.blob.core.windows.net/?sv=2019-10-10&ss=bfqt&srt=sco&sp=rwdlacupx&se=2020-06-22T13:11:48Z&st=2020-06-22T05:11:48Z&spr=https&sig=***b%2FuFoO5T2JGHA***"
sasKey = sasURL[sasURL.index('?'): len(sasURL)]
storageAccount = "maytorageaccount2020"
containerName = "source"
mountPoint = "/mnt/azureblobstorage"
dbutils.fs.mount(
source = f"wasbs://{containerName}@{storageAccount}.blob.core.windows.net/",
mount_point = mountPoint,
extra_configs = {f"fs.azure.sas.{containerName}.{storageAccount}.blob.core.windows.net": sasKey}
)
#To read from Azure blob
df=spark.read.option("header","true").csv("/mnt/azureblobstorage/Employee.txt").head(5)
display(df)
Hope this article helped you :)
Regards,
Amit
Nice work Amir. This is really great 👍👍
Very good article
This is really useful. Thanks for sharing.
Great work
Helpful article