Azure Databricks - Accessing File Data
This is part of a (to be) series of topics covering my thoughts and musings about Azure Databricks.
How do you access arbitrary file data in your data lake, that isn't already integrated into your hive catalogs, from Azure Databricks? There's a couple of primary methods:
The goal with this discussion is simple: safety, security and simplicity. Here's my security concerns with each method:
If we want to use session scoped credentials and make it easy for our users to do so, how do we do that? My running assumption is you're using repos and you have a secret scope already set up.
First, lets create a folder structure and create a .py file to house our functions:
Next, add some code to set up your spark session credentials to access your existing secret scope and service principal:
Recommended by LinkedIn
from pyspark.sql import SparkSession
def set_session_scope(scope: str, client_id: str, client_secret: str, tenant_id: str, storage_account_name: str, container_name: str) -> str:
"""Connects to azure key vault, authenticates, and sets spark session to use specified service principal for read/write to adls
Args:
scope: The azure key vault scope name
client_id: The key name of the secret for the client id
client_secret: The key name of the secret for the client secret
tenant_id: The key name of the secret for the tenant id
storage_account_name: The name of the storage account resource to read/write from
container_name: The name of the container resource in the storage account to read/write from
Returns:
Spark configs get set appropriately
abfs_path (string): The abfss:// path to the storage account and container
"""
spark = SparkSession.builder.getOrCreate()
try:
from pyspark.dbutils import DBUtils
dbutils = DBUtils(spark)
except ImportError:
import IPython
dbutils = IPython.get_ipython().user_ns["dbutils"]
client_id = dbutils.secrets.get(scope = scope, key = client_id)
client_secret = dbutils.secrets.get(scope = scope, key = client_secret)
tenant_id = dbutils.secrets.get(scope = scope, key = tenant_id)
spark.conf.set(f"fs.azure.account.auth.type.{storage_account_name}.dfs.core.windows.net", "OAuth")
spark.conf.set(f"fs.azure.account.oauth.provider.type.{storage_account_name}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(f"fs.azure.account.oauth2.client.id.{storage_account_name}.dfs.core.windows.net", client_id)
spark.conf.set(f"fs.azure.account.oauth2.client.secret.{storage_account_name}.dfs.core.windows.net", client_secret)
spark.conf.set(f"fs.azure.account.oauth2.client.endpoint.{storage_account_name}.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")
abfs_path = "abfss://" + container_name + "@" + storage_account_name + ".dfs.core.windows.net/"
return abfs_path
Now that we have our function to set our credential scope and allow us to access abfs, how do we initialize this in our downstream notebooks or DLT pipelines?
scope = "my-scope-name"
import sys, os
repo_folder_name = dbutils.secrets.get(scope=scope, key="repo_folder_name")
sys.path.append(os.path.abspath(f'/Workspace/Repos/{repo_folder_name}/path'))
from databricks.functions.azure import set_session_scope"
This gets our function into our notebook or DLT pipeline and we're ready to connect to abfs. My version of the code returns a text string to the abfs path location for ease of use:
# Set session scope and connect to abfss to read source data
client_id = "my-client-id"
client_secret = "my-client-secret"
tenant_id = "my-tenant-id"
storage_account_name = "mystorageaccount"
container_name = "mycontainer"
folder_path = "" # you can add path/to/folder/here
abfs_path = set_session_scope(
scope = scope,
client_id = client_id,
client_secret = client_secret
tenant_id = tenant_id
storage_account_name = storage_account_name
container_name = container_name
)
list_of_files = dbutils.fs.ls(abfs_path + folder_path)
And voila, we have access to our abfs file data, access is controlled by secret scopes containing specific service principals that have read and or write to specific locations. Minimized security concerns, maximized management levers, easy to use.
For further reading on the topic I'd highly suggest perusing the ADLS access patterns with Databricks discussion over at https://github.com/hurtn/datalake-ADLS-access-patterns-with-Databricks