Configuration for Data sources in Databricks (python)
Databricks and common datasources

Configuration for Data sources in Databricks (python)

Hi Everyone,

In this article, I will be writing about the common data sources that we use on Databricks. I will be providing the configuration for following data sources.

  • Azure SQL Database
  • Azure Data lake gen 1
  • Azure Data lake gen 2
  • Azure blob storage

1. Azure SQL Database

To connect with Azure SQL DB, you need to provide following details like JdbcHostname,JdbPort,JdbcDatabase and in properties, you need to pass user and password.

Please refer the below screenshot:-

No alt text provided for this image

Please find the below code-

jdbcHostname = "*******.database.windows.net"
jdbcPort = "1433"
jdbcDatabase = "******"
properties = 
    {
        "user": "amit",
        "password": "******"
    }

url = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
query_results = "(select * from dbo.EMP) as EMP"


#To read from Azure SQL database

df = spark.read.jdbc(url=url, table=query_results, properties=properties)

display(df)

2. Azure Data lake gen 1

To connect with Azure Data lake gen 1, you need to provide the below details serviceprincpleid(clientid),serviceprincplekey(credential) and directoryid.

In order to get serviceprincpleid, serviceprincplekey and directoryid you need to register an application under Azure Active Directory. Once you register an app, you will get all the necessary details to create your configuration.

Please refer the below screenshot to get see serviceprincpleid and directoryid.

No alt text provided for this image

Similarly, once app is registered, you can create key for your app. This will be your servicepricplekey. One you click on Add, you will get a secret(key) that you can copy right away, as it won't available after that.

No alt text provided for this image

Once this is completed, we need to give access to this app (databrickspoc) to our data lake. To achieve this, go to your data lake account and under Access Control (IAM) click on Add-> Add role assignment. Then Select the Role (I have selected owner in my case) then search for your app that you have created in your previous step then click on Save.

No alt text provided for this image

Once we have all those three connection details available, we can create our configuration to connect to Azure Data lake gen 1.

No alt text provided for this image

Please find the below code-

configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
             "dfs.adls.oauth2.client.id": "*****************",
             "dfs.adls.oauth2.credential": "***********",
             "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/*************/oauth2/token"}


dbutils.fs.mount(
    source = "adl://adlsgen12020.azuredatalakestore.net/adlsgen1",
    mount_point = "/mnt/mountadlsgen1",
  extra_configs = configs)


#To read from Azure Data lake Gen1

df=spark.read.option("header","true").csv("/mnt/mountadlsgen1/Employee.txt").head(5)
display(df)

3. Azure datalake gen2

To connect with Azure Data lake gen 2, you need to provide the below details serviceprincpleid(clientid),serviceprincplekey(credential) and directoryid.(refer the steps mentioned above)

Once we have all those three connection details available, we can create our configuration to connect to Azure Data lake gen 2.

No alt text provided for this image

Please find the code-

serviceprincpleid="***********************"
serviceprincplekey="*****************************"
configs = {
  "fs.azure.account.auth.type": "OAuth",
   "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
   "fs.azure.account.oauth2.client.id":serviceprincpleid,
   "fs.azure.account.oauth2.client.secret":serviceprincplekey,
   "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/*************/oauth2/token"}


dbutils.fs.mount(
source = "abfss://source@a***.dfs.core.windows.net/",
mount_point = "/mnt/mountadlsgen2",
extra_configs = configs)


#To read from Azure Data lake Gen2

df=spark.read.option("header","true").csv("/mnt/mountadlsgen2/Employee.txt").head(5)
display(df)

4. Azure blob storage

To connect with Azure blob storage, you need to provide the below details like sas key. To generate sas key, go to your Storage Account and search for "Shared access signature" and click on Generate SAS and connection string and copy the Blob service SAS URL.

Please refer below screenshots.

No alt text provided for this image
No alt text provided for this image

Now, we are good to set up the azure blob service into our Databricks.

Please refer the below screenshot-

No alt text provided for this image

Please refer the code-

sasURL = "https://ma***geaccount2020.blob.core.windows.net/?sv=2019-10-10&ss=bfqt&srt=sco&sp=rwdlacupx&se=2020-06-22T13:11:48Z&st=2020-06-22T05:11:48Z&spr=https&sig=***b%2FuFoO5T2JGHA***"
sasKey = sasURL[sasURL.index('?'): len(sasURL)]
storageAccount = "maytorageaccount2020"
containerName = "source"
mountPoint = "/mnt/azureblobstorage"


dbutils.fs.mount(
  source = f"wasbs://{containerName}@{storageAccount}.blob.core.windows.net/",
  mount_point = mountPoint,
  extra_configs = {f"fs.azure.sas.{containerName}.{storageAccount}.blob.core.windows.net": sasKey}
)

#To read from Azure blob 

df=spark.read.option("header","true").csv("/mnt/azureblobstorage/Employee.txt").head(5)
display(df)

Hope this article helped you :)

Regards,

Amit

To view or add a comment, sign in

More articles by Amit Kumar

Explore content categories