Accessing Azure Data Lake through Databricks: Authentication Methods Explained

Azure Data Lake (ADLS) is a powerful, scalable solution for handling vast amounts of data. When accessing ADLS through Databricks, different authentication methods can be employed depending on the security requirements and use cases. In this article, we'll explore four major methods of authenticating Databricks to ADLS: Access Keys, SAS Tokens, Azure Service Principals, and Unity Catalog. We'll demonstrate both session-scoped and cluster-scoped configurations, providing practical Spark configuration and code examples for each.

1. Accessing ADLS via Access Keys

Access keys provide full access to the storage account, similar to a superuser. While this method is simple, it is not highly secure and should be protected in Azure Key Vault.

Steps:

  1. Obtain Access Key: From the Azure portal, go to your Storage Account → Access Keys. Copy one of the two access keys provided.
  2. Configure Spark with Access Key: Set the access key in spark.conf to configure Databricks for accessing Azure Data Lake using the access key.
  3. Access Data: This will allow you to access files in Azure Blob Storage as long as the access key remains valid.

Note: Replace placeholders like <your-storage-account-name> and <your-access-key> with the actual values from your setup.

Session-Scoped Access with Access Keys

In this approach, we configure the access key directly in a notebook. This key is valid only for the current session. The access key can be obtained from the Azure portal under the "Access keys" section of your storage account.

# Step 1: Define storage account details
storage_account_name = "<your-storage-account-name>"
container_name = "<your-container-name>"
access_key = "<your-access-key>"

# Step 2: Set Spark configuration with the access key
spark.conf.set(f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net", access_key)

# Step 3: Define the file path (using ABFSS protocol for secure access)
file_path = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/<your-file-path>"

# Step 4: Use dbutils to list files in the container (optional)
display(dbutils.fs.ls(f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/"))

# Step 5: Read data from the Azure Data Lake file using Spark
df = spark.read.format("csv").option("header", "true").load(file_path)

# Step 6: Show data from the DataFrame
df.show()
        

Cluster-Scoped Access with Access Keys

Cluster-scoped configuration sets the access key at the cluster level, making it available for every notebook attached to the cluster. This is done by editing the cluster configuration:

Once the cluster is restarted, any notebook attached to the cluster can access the data without specifying the key in the notebook.

2. Accessing ADLS via SAS Tokens

SAS (Shared Access Signatures) provide more granular access control than access keys, allowing control over specific resources (like blobs) and permissions (e.g., read-only access) with a defined expiration time.

Steps:

  1. Generate SAS Token: From the Azure portal, go to your Storage Account → Containers. Generate a SAS token with the required permissions (read/write access) for a container or blob.
  2. Configure Spark with SAS Token: Set the SAS token in spark.conf to configure Databricks to use the token.
  3. Access Data: You can access the data as long as the SAS token remains valid, which is typically for a limited time.

Note: Replace placeholders like <your-storage-account-name>, <your-container-name>, and <your-sas-token> with the actual values from your setup.

Session-Scoped Access with SAS Tokens

First, generate a SAS token in the Azure portal by navigating to your storage account, selecting the container, and generating a SAS token. You can then configure the SAS token in the notebook session:

# Step 1: Define storage account and container details
storage_account_name = "<your-storage-account-name>"
container_name = "<your-container-name>"
sas_token = "<your-sas-token>"

# Step 2: Set Spark configuration with the SAS token
# The SAS token should not include the "?" at the start when used here
spark.conf.set(f"fs.azure.sas.{container_name}.{storage_account_name}.dfs.core.windows.net", sas_token)

# Step 3: Define the file path (using ABFSS protocol for secure access)
file_path = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/<your-file-path>"

# Step 4: Use dbutils to list files in the container (optional)
display(dbutils.fs.ls(f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/"))

# Step 5: Read data from the Azure Data Lake file using Spark
df = spark.read.format("csv").option("header", "true").load(file_path)

# Step 6: Show data from the DataFrame
df.show()
        

Cluster-Scoped Access with SAS Tokens

For cluster-scoped SAS token configuration, add the following to the cluster configuration:

Restart the cluster, and now all notebooks attached to this cluster can access the data using the SAS token without re-authentication.

3. Accessing ADLS via Azure Service Principal

Azure Service Principals provide the most secure way of authenticating to Azure services. Service Principals are registered in Azure Active Directory (AAD) and can be assigned role-based access control (RBAC) for granular permission management.

Steps:

  1. Create a Service Principal:
  2. Assign Role to the Service Principal:
  3. Configure Spark with Service Principal: Set up Service Principal authentication in spark.conf to configure Databricks for secure access using OAuth.
  4. Access Data: This allows authenticated access using Service Principal credentials.

Note: Replace placeholders like <your-storage-account-name>, <your-client-id>, <your-client-secret>, and <your-tenant-id> with the actual values from your setup.

Session-Scoped Access with Azure Service Principal

To use a Service Principal, you need to register an application in AAD, generate a client secret, and assign the appropriate role (e.g., "Storage Blob Data Contributor") to the Service Principal.

# Step 1: Define Service Principal credentials and storage account details
storage_account_name = "<your-storage-account-name>"
client_id = "<your-client-id>"  # The Application (Client) ID of the Service Principal
tenant_id = "<your-tenant-id>"  # The Tenant ID or Directory ID from Azure AD
client_secret = "<your-client-secret>"  # The Client Secret generated for the Service Principal

# Step 2: Set Spark configuration for authentication with Service Principal
spark.conf.set(f"fs.azure.account.auth.type.{storage_account_name}.dfs.core.windows.net", "OAuth")
spark.conf.set(f"fs.azure.account.oauth.provider.type.{storage_account_name}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(f"fs.azure.account.oauth2.client.id.{storage_account_name}.dfs.core.windows.net", client_id)
spark.conf.set(f"fs.azure.account.oauth2.client.secret.{storage_account_name}.dfs.core.windows.net", client_secret)
spark.conf.set(f"fs.azure.account.oauth2.client.endpoint.{storage_account_name}.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")

# Step 3: Define the file path (using ABFSS protocol for secure access)
file_path = f"abfss://<container-name>@{storage_account_name}.dfs.core.windows.net/<your-file-path>"

# Step 4: Use dbutils to list files in the container (optional)
display(dbutils.fs.ls(f"abfss://<container-name>@{storage_account_name}.dfs.core.windows.net/"))

# Step 5: Read data from the Azure Data Lake file using Spark
df = spark.read.format("csv").option("header", "true").load(file_path)

# Step 6: Show data from the DataFrame
df.show()
        

Cluster-Scoped Access with Azure Service Principal

For cluster-scoped configuration, add the following to the cluster configuration:

Restart the cluster, and the Service Principal credentials will be available for all notebooks attached to the cluster.

spark.hadoop.fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net <application-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net {{secrets/<secret-scope>/<service-credential-key>}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net https://login.microsoftonline.com/<directory-id>/oauth2/token        

4. Accessing ADLS via Unity Catalog

Unity Catalog in Databricks provides a unified governance solution for managing access to data across Databricks workspaces and cloud storage accounts. It supports fine-grained permissions and simplifies access control.

Unity Catalog is managed through the Databricks workspace, and you do not need to set individual Spark configurations. Once you have set up Unity Catalog and granted the necessary permissions, accessing data is straightforward.

Steps to Access Data Lake through Unity Catalog

Enable Unity Catalog in Databricks

Before accessing Azure Data Lake using Unity Catalog, you must enable and configure Unity Catalog on your Databricks workspace.

  1. Create a Metastore for Unity Catalog:
  2. Configure the Metastore with a Storage Account:
  3. Assign Unity Catalog Permissions:

Grant Permissions for Users or Groups to Access Data

Now that Unity Catalog is enabled and your metastore is set up, grant permissions to users or groups that will be accessing the Azure Data Lake data through Unity Catalog.

  • For example, you can give a user read or write permissions on a specific schema, table, or dataset.

-- Grant access to a user or group GRANT SELECT ON catalog_name.schema_name.table_name TO user_or_group;        

Configure Data Lake Access with Unity Catalog

In Unity Catalog, you manage access to external data (such as Azure Data Lake) via External Locations. The external location is linked to a storage account, and Unity Catalog uses the metadata to enforce access control.

Create External Location for Data Lake:This is where you associate the external Azure Data Lake storage with Unity Catalog.

CREATE EXTERNAL LOCATION external_location_name URL 'abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/' WITH STORAGE CREDENTIAL 'storage_credential_name' COMMENT 'External location for Azure Data Lake';        

Create Storage Credential:Storage credentials define how Unity Catalog authenticates to the external location (Azure Data Lake). You can create these credentials using Service Principal, Managed Identity, or Shared Access Signature (SAS).

CREATE STORAGE CREDENTIAL storage_credential_name WITH AZURE_MANAGED_IDENTITY 'managed_identity_principal_id' COMMENT 'Credential for Azure Managed Identity';        

  • Alternatively, you can use Service Principal or SAS Token for authentication. Unity Catalog then uses the created STORAGE CREDENTIAL to access the data.
  • Create External Table in Unity Catalog:Use Unity Catalog to register the data lake files as tables so that users can query them using SQL.

CREATE TABLE catalog_name.schema_name.table_name USING DELTA LOCATION 'abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-path-to-delta-table>';        

Access Azure Data Lake Using Unity Catalog

Once the external location and tables are set up, you can access the data stored in Azure Data Lake through Unity Catalog using SQL or DataFrames in notebooks:

  • SQL Query: You can use SQL to query the external table created in Unity Catalog.

SELECT * FROM catalog_name.schema_name.table_name;        

  • PySpark DataFrame: Alternatively, you can read the data into a Spark DataFrame using PySpark.

df = spark.sql("SELECT * FROM catalog_name.schema_name.table_name") df.show()        

Enforce Fine-Grained Access Control

Unity Catalog allows you to define granular permissions on catalogs, schemas, tables, and columns. You can enforce row-level or column-level security policies based on users or groups.

  • Row-Level Security Example:

CREATE ROW FILTER ON catalog_name.schema_name.table_name FOR user_group_name USING (column_name = 'some_value');        

  • Column-Level Security Example:

GRANT SELECT (column_name) ON catalog_name.schema_name.table_name TO user_or_group;        

Monitor and Audit Access via Unity Catalog

Unity Catalog provides built-in auditing features that allow you to track who accessed what data. This helps ensure data governance and compliance with internal policies and regulations.

  • You can configure logging and auditing for Unity Catalog through the Databricks Admin Console.

Accessing Data via Unity Catalog

Using Unity Catalog, you can query tables and manage data access policies easily:

Unity Catalog integrates seamlessly with Azure AD and Databricks, offering secure, enterprise-grade access management.

Conclusion

Azure Data Lake can be accessed through Databricks using various authentication methods: Access Keys, SAS Tokens, Azure Service Principals, and Unity Catalog. While Access Keys are simple to use, SAS Tokens and Service Principals provide better security and fine-grained control. Unity Catalog is the most scalable and secure option, ideal for managing access across large organizations.

When choosing an authentication method, consider the level of security and control required. Session-scoped configurations provide flexibility for individual notebooks, while cluster-scoped configurations offer streamlined access across all notebooks attached to a cluster.

Hi, Is it possible to authenticate Azure Data Lake using Managed Identity as Federated Identity Credentials? If so, can you please share the documentation on how to implement this. SPN and Unity Catalog is not an option for me.

To view or add a comment, sign in

More articles by Sulfi Bashy

Others also viewed

Explore content categories