Accessing Azure Data Lake through Databricks: Authentication Methods Explained
Azure Data Lake (ADLS) is a powerful, scalable solution for handling vast amounts of data. When accessing ADLS through Databricks, different authentication methods can be employed depending on the security requirements and use cases. In this article, we'll explore four major methods of authenticating Databricks to ADLS: Access Keys, SAS Tokens, Azure Service Principals, and Unity Catalog. We'll demonstrate both session-scoped and cluster-scoped configurations, providing practical Spark configuration and code examples for each.
1. Accessing ADLS via Access Keys
Access keys provide full access to the storage account, similar to a superuser. While this method is simple, it is not highly secure and should be protected in Azure Key Vault.
Steps:
Note: Replace placeholders like <your-storage-account-name> and <your-access-key> with the actual values from your setup.
Session-Scoped Access with Access Keys
In this approach, we configure the access key directly in a notebook. This key is valid only for the current session. The access key can be obtained from the Azure portal under the "Access keys" section of your storage account.
# Step 1: Define storage account details
storage_account_name = "<your-storage-account-name>"
container_name = "<your-container-name>"
access_key = "<your-access-key>"
# Step 2: Set Spark configuration with the access key
spark.conf.set(f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net", access_key)
# Step 3: Define the file path (using ABFSS protocol for secure access)
file_path = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/<your-file-path>"
# Step 4: Use dbutils to list files in the container (optional)
display(dbutils.fs.ls(f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/"))
# Step 5: Read data from the Azure Data Lake file using Spark
df = spark.read.format("csv").option("header", "true").load(file_path)
# Step 6: Show data from the DataFrame
df.show()
Cluster-Scoped Access with Access Keys
Cluster-scoped configuration sets the access key at the cluster level, making it available for every notebook attached to the cluster. This is done by editing the cluster configuration:
Once the cluster is restarted, any notebook attached to the cluster can access the data without specifying the key in the notebook.
2. Accessing ADLS via SAS Tokens
SAS (Shared Access Signatures) provide more granular access control than access keys, allowing control over specific resources (like blobs) and permissions (e.g., read-only access) with a defined expiration time.
Steps:
Note: Replace placeholders like <your-storage-account-name>, <your-container-name>, and <your-sas-token> with the actual values from your setup.
Session-Scoped Access with SAS Tokens
First, generate a SAS token in the Azure portal by navigating to your storage account, selecting the container, and generating a SAS token. You can then configure the SAS token in the notebook session:
# Step 1: Define storage account and container details
storage_account_name = "<your-storage-account-name>"
container_name = "<your-container-name>"
sas_token = "<your-sas-token>"
# Step 2: Set Spark configuration with the SAS token
# The SAS token should not include the "?" at the start when used here
spark.conf.set(f"fs.azure.sas.{container_name}.{storage_account_name}.dfs.core.windows.net", sas_token)
# Step 3: Define the file path (using ABFSS protocol for secure access)
file_path = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/<your-file-path>"
# Step 4: Use dbutils to list files in the container (optional)
display(dbutils.fs.ls(f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/"))
# Step 5: Read data from the Azure Data Lake file using Spark
df = spark.read.format("csv").option("header", "true").load(file_path)
# Step 6: Show data from the DataFrame
df.show()
Cluster-Scoped Access with SAS Tokens
For cluster-scoped SAS token configuration, add the following to the cluster configuration:
Restart the cluster, and now all notebooks attached to this cluster can access the data using the SAS token without re-authentication.
3. Accessing ADLS via Azure Service Principal
Azure Service Principals provide the most secure way of authenticating to Azure services. Service Principals are registered in Azure Active Directory (AAD) and can be assigned role-based access control (RBAC) for granular permission management.
Steps:
Note: Replace placeholders like <your-storage-account-name>, <your-client-id>, <your-client-secret>, and <your-tenant-id> with the actual values from your setup.
Session-Scoped Access with Azure Service Principal
To use a Service Principal, you need to register an application in AAD, generate a client secret, and assign the appropriate role (e.g., "Storage Blob Data Contributor") to the Service Principal.
# Step 1: Define Service Principal credentials and storage account details
storage_account_name = "<your-storage-account-name>"
client_id = "<your-client-id>" # The Application (Client) ID of the Service Principal
tenant_id = "<your-tenant-id>" # The Tenant ID or Directory ID from Azure AD
client_secret = "<your-client-secret>" # The Client Secret generated for the Service Principal
# Step 2: Set Spark configuration for authentication with Service Principal
spark.conf.set(f"fs.azure.account.auth.type.{storage_account_name}.dfs.core.windows.net", "OAuth")
spark.conf.set(f"fs.azure.account.oauth.provider.type.{storage_account_name}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(f"fs.azure.account.oauth2.client.id.{storage_account_name}.dfs.core.windows.net", client_id)
spark.conf.set(f"fs.azure.account.oauth2.client.secret.{storage_account_name}.dfs.core.windows.net", client_secret)
spark.conf.set(f"fs.azure.account.oauth2.client.endpoint.{storage_account_name}.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")
# Step 3: Define the file path (using ABFSS protocol for secure access)
file_path = f"abfss://<container-name>@{storage_account_name}.dfs.core.windows.net/<your-file-path>"
# Step 4: Use dbutils to list files in the container (optional)
display(dbutils.fs.ls(f"abfss://<container-name>@{storage_account_name}.dfs.core.windows.net/"))
# Step 5: Read data from the Azure Data Lake file using Spark
df = spark.read.format("csv").option("header", "true").load(file_path)
# Step 6: Show data from the DataFrame
df.show()
Cluster-Scoped Access with Azure Service Principal
For cluster-scoped configuration, add the following to the cluster configuration:
Restart the cluster, and the Service Principal credentials will be available for all notebooks attached to the cluster.
spark.hadoop.fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net <application-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net {{secrets/<secret-scope>/<service-credential-key>}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net https://login.microsoftonline.com/<directory-id>/oauth2/token
4. Accessing ADLS via Unity Catalog
Unity Catalog in Databricks provides a unified governance solution for managing access to data across Databricks workspaces and cloud storage accounts. It supports fine-grained permissions and simplifies access control.
Unity Catalog is managed through the Databricks workspace, and you do not need to set individual Spark configurations. Once you have set up Unity Catalog and granted the necessary permissions, accessing data is straightforward.
Recommended by LinkedIn
Steps to Access Data Lake through Unity Catalog
Enable Unity Catalog in Databricks
Before accessing Azure Data Lake using Unity Catalog, you must enable and configure Unity Catalog on your Databricks workspace.
Grant Permissions for Users or Groups to Access Data
Now that Unity Catalog is enabled and your metastore is set up, grant permissions to users or groups that will be accessing the Azure Data Lake data through Unity Catalog.
-- Grant access to a user or group GRANT SELECT ON catalog_name.schema_name.table_name TO user_or_group;
Configure Data Lake Access with Unity Catalog
In Unity Catalog, you manage access to external data (such as Azure Data Lake) via External Locations. The external location is linked to a storage account, and Unity Catalog uses the metadata to enforce access control.
Create External Location for Data Lake:This is where you associate the external Azure Data Lake storage with Unity Catalog.
CREATE EXTERNAL LOCATION external_location_name URL 'abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/' WITH STORAGE CREDENTIAL 'storage_credential_name' COMMENT 'External location for Azure Data Lake';
Create Storage Credential:Storage credentials define how Unity Catalog authenticates to the external location (Azure Data Lake). You can create these credentials using Service Principal, Managed Identity, or Shared Access Signature (SAS).
CREATE STORAGE CREDENTIAL storage_credential_name WITH AZURE_MANAGED_IDENTITY 'managed_identity_principal_id' COMMENT 'Credential for Azure Managed Identity';
CREATE TABLE catalog_name.schema_name.table_name USING DELTA LOCATION 'abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-path-to-delta-table>';
Access Azure Data Lake Using Unity Catalog
Once the external location and tables are set up, you can access the data stored in Azure Data Lake through Unity Catalog using SQL or DataFrames in notebooks:
SELECT * FROM catalog_name.schema_name.table_name;
df = spark.sql("SELECT * FROM catalog_name.schema_name.table_name") df.show()
Enforce Fine-Grained Access Control
Unity Catalog allows you to define granular permissions on catalogs, schemas, tables, and columns. You can enforce row-level or column-level security policies based on users or groups.
CREATE ROW FILTER ON catalog_name.schema_name.table_name FOR user_group_name USING (column_name = 'some_value');
GRANT SELECT (column_name) ON catalog_name.schema_name.table_name TO user_or_group;
Monitor and Audit Access via Unity Catalog
Unity Catalog provides built-in auditing features that allow you to track who accessed what data. This helps ensure data governance and compliance with internal policies and regulations.
Accessing Data via Unity Catalog
Using Unity Catalog, you can query tables and manage data access policies easily:
Unity Catalog integrates seamlessly with Azure AD and Databricks, offering secure, enterprise-grade access management.
Conclusion
Azure Data Lake can be accessed through Databricks using various authentication methods: Access Keys, SAS Tokens, Azure Service Principals, and Unity Catalog. While Access Keys are simple to use, SAS Tokens and Service Principals provide better security and fine-grained control. Unity Catalog is the most scalable and secure option, ideal for managing access across large organizations.
When choosing an authentication method, consider the level of security and control required. Session-scoped configurations provide flexibility for individual notebooks, while cluster-scoped configurations offer streamlined access across all notebooks attached to a cluster.
Hi, Is it possible to authenticate Azure Data Lake using Managed Identity as Federated Identity Credentials? If so, can you please share the documentation on how to implement this. SPN and Unity Catalog is not an option for me.
Very helpful