Azure Databricks - Accessing File Data

Azure Databricks - Accessing File Data

This is part of a (to be) series of topics covering my thoughts and musings about Azure Databricks.

How do you access arbitrary file data in your data lake, that isn't already integrated into your hive catalogs, from Azure Databricks? There's a couple of primary methods:

  1. Mount the lake location directly to dbfs (databricks file system)
  2. Mount the lake location directly to dbfs, but secure it by only allowing clusters with Table Access Control
  3. Use an external table registered in your hive metastore
  4. Use a cluster scoped credential to access data lake storage directly
  5. Use a session scoped credential to access data lake storage directly

The goal with this discussion is simple: safety, security and simplicity. Here's my security concerns with each method:

  1. Mount points - Using mount points immediately exposes all data at the targeted mount location to every user of the workspace. Your only management lever if security is an issue is to restrict users in the Databricks workspace.
  2. Table Access Control Clusters Only - Here you're restricted to a single cluster type, and to use it you need a Premium Databricks workspace. Your management lever is promoting folks that need to access lake storage to the Admin role, which ignores Table Access Controls. This isn't always a viable option.
  3. External Tables - This is a great option if the scope of your storage needs are small and your data is already in a readable tabular format. Much harder to iterate over.
  4. Cluster Scoped Credentials - The management lever here is different dedicated clusters for each permission group. Extra administration overhead. This could result in a proliferation of clusters and a lot of extra Terraform code to manage.
  5. Session Scoped Credentials - My choice in managing access and the one I'll discuss further. Using service principals and secret scopes along with premade functions, we can access adls securely on the fly, as needed, without exposing access to other users. The management lever here is both secret scopes and service principals.

If we want to use session scoped credentials and make it easy for our users to do so, how do we do that? My running assumption is you're using repos and you have a secret scope already set up.

First, lets create a folder structure and create a .py file to house our functions:

No alt text provided for this image
Create a folder space, then create a "File" type at your destination. Don't forget __init__.py.

Next, add some code to set up your spark session credentials to access your existing secret scope and service principal:

from pyspark.sql import SparkSession


def set_session_scope(scope: str, client_id: str, client_secret: str, tenant_id: str, storage_account_name: str, container_name: str) -> str:
    
    """Connects to azure key vault, authenticates, and sets spark session to use specified service principal for read/write to adls
    
    Args:
        scope: The azure key vault scope name
        client_id: The key name of the secret for the client id
        client_secret: The key name of the secret for the client secret
        tenant_id: The key name of the secret for the tenant id
        storage_account_name: The name of the storage account resource to read/write from
        container_name: The name of the container resource in the storage account to read/write from


    Returns:
        Spark configs get set appropriately
        abfs_path (string): The abfss:// path to the storage account and container
    """


    spark = SparkSession.builder.getOrCreate()


    try:
        from pyspark.dbutils import DBUtils
        dbutils = DBUtils(spark)
    except ImportError:
        import IPython
        dbutils = IPython.get_ipython().user_ns["dbutils"]


    client_id = dbutils.secrets.get(scope = scope, key = client_id)
    client_secret = dbutils.secrets.get(scope = scope, key = client_secret)
    tenant_id = dbutils.secrets.get(scope = scope, key = tenant_id)


    spark.conf.set(f"fs.azure.account.auth.type.{storage_account_name}.dfs.core.windows.net", "OAuth")
    spark.conf.set(f"fs.azure.account.oauth.provider.type.{storage_account_name}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.conf.set(f"fs.azure.account.oauth2.client.id.{storage_account_name}.dfs.core.windows.net", client_id)
    spark.conf.set(f"fs.azure.account.oauth2.client.secret.{storage_account_name}.dfs.core.windows.net", client_secret)
    spark.conf.set(f"fs.azure.account.oauth2.client.endpoint.{storage_account_name}.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")
    
    abfs_path = "abfss://" + container_name + "@" + storage_account_name + ".dfs.core.windows.net/"
    
    return abfs_path        

Now that we have our function to set our credential scope and allow us to access abfs, how do we initialize this in our downstream notebooks or DLT pipelines?

scope = "my-scope-name"
import sys, os
repo_folder_name = dbutils.secrets.get(scope=scope, key="repo_folder_name")
sys.path.append(os.path.abspath(f'/Workspace/Repos/{repo_folder_name}/path'))
from databricks.functions.azure import set_session_scope"        

This gets our function into our notebook or DLT pipeline and we're ready to connect to abfs. My version of the code returns a text string to the abfs path location for ease of use:

# Set session scope and connect to abfss to read source data

client_id = "my-client-id"
client_secret = "my-client-secret"
tenant_id = "my-tenant-id"
storage_account_name = "mystorageaccount"
container_name = "mycontainer"
folder_path = "" # you can add path/to/folder/here

abfs_path = set_session_scope(
 scope = scope,
 client_id = client_id,
 client_secret = client_secret
 tenant_id = tenant_id
 storage_account_name = storage_account_name
 container_name = container_name
) 

list_of_files = dbutils.fs.ls(abfs_path + folder_path)        

And voila, we have access to our abfs file data, access is controlled by secret scopes containing specific service principals that have read and or write to specific locations. Minimized security concerns, maximized management levers, easy to use.

For further reading on the topic I'd highly suggest perusing the ADLS access patterns with Databricks discussion over at https://github.com/hurtn/datalake-ADLS-access-patterns-with-Databricks

To view or add a comment, sign in

Others also viewed

Explore content categories