Configuration for Data sources in Databricks (python)

Amit Kumar

Published Jun 21, 2020

Hi Everyone,

In this article, I will be writing about the common data sources that we use on Databricks. I will be providing the configuration for following data sources.

Azure SQL Database
Azure Data lake gen 1
Azure Data lake gen 2
Azure blob storage

1. Azure SQL Database

To connect with Azure SQL DB, you need to provide following details like JdbcHostname,JdbPort,JdbcDatabase and in properties, you need to pass user and password.

Please refer the below screenshot:-

Please find the below code-

jdbcHostname = "*******.database.windows.net"
jdbcPort = "1433"
jdbcDatabase = "******"
properties = 
    {
        "user": "amit",
        "password": "******"
    }

url = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
query_results = "(select * from dbo.EMP) as EMP"


#To read from Azure SQL database

df = spark.read.jdbc(url=url, table=query_results, properties=properties)

display(df)

2. Azure Data lake gen 1

To connect with Azure Data lake gen 1, you need to provide the below details serviceprincpleid(clientid),serviceprincplekey(credential) and directoryid.

In order to get serviceprincpleid, serviceprincplekey and directoryid you need to register an application under Azure Active Directory. Once you register an app, you will get all the necessary details to create your configuration.

Please refer the below screenshot to get see serviceprincpleid and directoryid.

Similarly, once app is registered, you can create key for your app. This will be your servicepricplekey. One you click on Add, you will get a secret(key) that you can copy right away, as it won't available after that.

Once this is completed, we need to give access to this app (databrickspoc) to our data lake. To achieve this, go to your data lake account and under Access Control (IAM) click on Add-> Add role assignment. Then Select the Role (I have selected owner in my case) then search for your app that you have created in your previous step then click on Save.

Once we have all those three connection details available, we can create our configuration to connect to Azure Data lake gen 1.

Please find the below code-

configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
             "dfs.adls.oauth2.client.id": "*****************",
             "dfs.adls.oauth2.credential": "***********",
             "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/*************/oauth2/token"}


dbutils.fs.mount(
    source = "adl://adlsgen12020.azuredatalakestore.net/adlsgen1",
    mount_point = "/mnt/mountadlsgen1",
  extra_configs = configs)


#To read from Azure Data lake Gen1

df=spark.read.option("header","true").csv("/mnt/mountadlsgen1/Employee.txt").head(5)
display(df)

3. Azure datalake gen2

To connect with Azure Data lake gen 2, you need to provide the below details serviceprincpleid(clientid),serviceprincplekey(credential) and directoryid.(refer the steps mentioned above)

Once we have all those three connection details available, we can create our configuration to connect to Azure Data lake gen 2.

Please find the code-

serviceprincpleid="***********************"
serviceprincplekey="*****************************"
configs = {
  "fs.azure.account.auth.type": "OAuth",
   "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
   "fs.azure.account.oauth2.client.id":serviceprincpleid,
   "fs.azure.account.oauth2.client.secret":serviceprincplekey,
   "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/*************/oauth2/token"}


dbutils.fs.mount(
source = "abfss://source@a***.dfs.core.windows.net/",
mount_point = "/mnt/mountadlsgen2",
extra_configs = configs)


#To read from Azure Data lake Gen2

df=spark.read.option("header","true").csv("/mnt/mountadlsgen2/Employee.txt").head(5)
display(df)

4. Azure blob storage

To connect with Azure blob storage, you need to provide the below details like sas key. To generate sas key, go to your Storage Account and search for "Shared access signature" and click on Generate SAS and connection string and copy the Blob service SAS URL.

Please refer below screenshots.

Now, we are good to set up the azure blob service into our Databricks.

Please refer the below screenshot-

Please refer the code-

sasURL = "https://ma***geaccount2020.blob.core.windows.net/?sv=2019-10-10&ss=bfqt&srt=sco&sp=rwdlacupx&se=2020-06-22T13:11:48Z&st=2020-06-22T05:11:48Z&spr=https&sig=***b%2FuFoO5T2JGHA***"
sasKey = sasURL[sasURL.index('?'): len(sasURL)]
storageAccount = "maytorageaccount2020"
containerName = "source"
mountPoint = "/mnt/azureblobstorage"


dbutils.fs.mount(
  source = f"wasbs://{containerName}@{storageAccount}.blob.core.windows.net/",
  mount_point = mountPoint,
  extra_configs = {f"fs.azure.sas.{containerName}.{storageAccount}.blob.core.windows.net": sasKey}
)

#To read from Azure blob 

df=spark.read.option("header","true").csv("/mnt/azureblobstorage/Employee.txt").head(5)
display(df)

Hope this article helped you :)

Regards,

Amit

Ashok Kumar Dabbadi 4y

Nice work Amir. This is really great 👍👍

Ashish Kumar Panda 5y

Very good article

Kalyani Pophali 5y

This is really useful. Thanks for sharing.

Naveen kumar 5y

Great work

Configuration for Data sources in Databricks (python)

Amit Kumar

1. Azure SQL Database

2. Azure Data lake gen 1

3. Azure datalake gen2

4. Azure blob storage

More articles by Amit Kumar

Explore content categories

1. Azure SQL Database

2. Azure Data lake gen 1

3. Azure datalake gen2

4. Azure blob storage

More articles by Amit Kumar

Model view for Azure Analysis service is available in Power BI desktop

Azure Storage cost management- Best practices

Dataflow in Azure Data Factory V2-Best practices

Databricks: Connecting to Azure SQL Database and loading the data into Azure datalake gen1

Power BI: Refresh dataset using Microsoft flow

Power BI September 2019 update

Power BI: Power BI best practices

Power BI desktop August 2019 update

Power BI Desktop July 2019 features

Power BI – How to view Usage metrics in Power BI workspace

Explore content categories