Create a Secure Data Lake with Azure Data Lake and Azure Databricks
The idea of a data lake approach in a modern data architecture is becoming increasingly popular. As organizations are faced with greater and greater volumes and variety of data, the idea of a storage architecture that doesn’t require complex data preparation up front is very appealing. In the past, building a data lake was challenging because it required lots of hardware and the software was difficult to install, maintain and scale.
The cloud has addressed these challenges by providing Platform as a Service (PaaS) offerings that give you all the flexibility and capability of their on-premises counterparts but put the onus of operating and managing the infrastructure largely on the cloud provider.
Secure Data Lake Architecture
I’m not going to spend time in this post reviewing a full modern data architecture. That would entail discussions of data ingestion, batch and stream processing and analytics. I’ll save those topics for a future post. Today, I want to specifically focus on securing data in the data lake. Maybe it’s out of order, but it was top of mind for me today.
Data Security is very high on the list of topics that are of interest to enterprise customers when discussing a data platform strategy. GDPR, the California Consumer Privacy Act and the China Cybersecurity Law require businesses to tightly control and audit access to data. Building a secure data lake architecture is another area that was very challenging with on-premises technologies. Platforms as a Service in the public cloud are making it far easier.
Three technologies in Azure make building a secure data lake a snap…
Azure Data Lake Storage
Azure Storage is one of the core services of the Azure cloud. By core, I mean Azure couldn’t exist without it. It provides storage services at low cost across all Azure regions at virtually unlimited scale. Azure Storage comprises a number of different storage technologies but at the core of it all is Blob storage. Blob storage is kind of “catch-all” storage. You can store any sort of BLOb (Binary Large Object) in it. It doesn’t have a real directory/file structure (although we can make it look like it does) and it doesn’t have object level security other than through the use of keys. If you have access to the key, you have access to anything secured by that key. Blob is great for general purpose storage, archive and application storage but it’s not a file system.
On the other hand, is Azure Data Lake Storage (ADLS). ADLS sits on top of the same underlying infrastructure as Blob. The difference is that ADLS storage IS a true file system. It has directories, sub-directories and items just like any file system you are used to. The other thing it has is Access Control Lists (ACLs). ACLs in ADLS provide a POSIX compliant permissions system over data stored in Azure Data Lake Storage. The credential you use to access data in the ADLS account is your Azure Active Directory credential.
Azure Active Directory
When I’m asked to differentiate Azure from other clouds, one of the first things I mention is Azure Active Directory. AAD is the same Active Directory that many, many businesses use for identity and authorization on-premises extended to the cloud. With AAD, you can have a single sign-on experience from your local computer to Azure to any services secured by Azure AD. It makes complex tasks (like creating a secure data lake) far easier.
Azure Databricks
To perform any sort of processing on data in the data lake, you need some sort of compute platform. For data lake analytics, the platform of choice has been Hadoop. Microsoft has a Hadoop offering called HDInsight that also simplifies big data processing. HDInsight is very powerful and flexible and provides all of the goodness that the Hadoop platform can offer. However, it can still be challenging to administer, scale and secure.
Apache Spark has taken the lead for big data analytics. Spark can run on the Hadoop platform but there is also a purely Platform as a Service offering for Spark known as Azure Databricks. Databricks is not a Microsoft product. Microsoft partners with Databricks to deliver their solution as a service in Azure. That means that all of the overhead of administering the environment falls on Microsoft and what you worry about is data and analysis. One of the investments that Microsoft and Databricks have made together is the integration of Azure Active Directory. What this means to you is that when your data analysts sign into the Databricks environment with their AAD account and you have data stored and secured in Azure Data Lake Storage, the user’s credentials are passed through seamlessly from one service to the other to ensure that whatever data security rules you put in place are enforced.
Create a Secure Set of Folders
The process of creating the ADLS account and file system are outside the scope of this article. For a walkthrough of that process, please see: https://docs.microsoft.com/en-us/azure/storage/common/storage-account-create
From the Azure Portal, the easiest place to interact with your data lake is using the Storage Explorer (preview) feature on the Storage Account blade. This feature provides many of the same features as Azure Storage Explorer, a cross-platform application for interacting with Azure Storage. Other options are the Azure CLI, Powershell or the Azure API’s.
You can see I have created a folder called “secure”. To create an Access Control List on this folder I can just right-click the folder and click “Manage Access”.
To grant myself access to the new folder, I just type my email address in the “Add user, group or service principal” box and click Add. Then I can assign the permissions on the folder. I also selected “Default” permissions. This automatically cascades these permissions to new objects when they are created under this folder. You obviously shouldn’t assign permissions to individual users in a production environment. Use groups to keep things tidy.
Under the first folder, I created two additional folders. “chris_has_access” inherited the permissions from the “secure” folder because of the “Default” rule. For the “chris_doesnt_have_access” folder, I changed the permissions.
The last thing I did was upload a file to each folder. To do this, I had to use the “Containers” blade of the Azure Portal. Currently, the Azure Storage Explorer (Preview) feature doesn’t support file upload. If you use the desktop version of Storage Explorer, this isn’t a problem.
I can hear you yelling at this point…WAIT A MINUTE! You created a rule that said you don’t have access to that folder, but you uploaded a file to it! That’s correct and it illustrates the difference between two types of permissions in Azure. Every Azure service is governed by Role Based Access Control (RBAC). Think of these roles as Administrator roles for the cloud. Because I have full rights on my Azure subscription, I can load data into the data lake. I can take this permission away from an administrator to further lock this down but that is out of scope for this discussion. The ACL permissions are what we’re interested in and in the next section, you’ll see how those apply to someone who would be accessing data from a platform like Databricks.
Accessing Data in the Secure Data Lake
Now that I have data in the data lake, I’m going to switch to Databricks. In Databricks, you create clusters of virtual machines to interact with data. Databricks makes this really easy by allowing you to pick from different runtime versions and virtual machine SKUs in order to create a cluster that exactly fits your needs.
There are two different cluster modes. Standard clusters support a small number of users and are best for individuals or small teams that are doing exploratory data analysis. High Concurrency clusters support many users working at the same time. They are great for larger analytics teams or to support high-concurrency BI applications.
To allow credential passthrough to Azure Data Lake Storage, all you need to do is check the "Azure Data Lake Storage Credential Passthrough" checkbox. If you choose a Standard cluster, it will be single user. If you want to share the cluster with multiple users, choose High Concurrency.
Once your cluster is running, you’re ready to connect to the ADLS account and start analyzing data. Since everything will execute in the context of the logged in user, you never need to include passwords, storage keys or service principle secrets in your code or persist them on your cluster.
Accessing Data
We interact with data in Databricks through notebooks. Notebooks are an interactive coding environment that let you write Python, SQL, Scala or R to explore, process and visualize data.
In the notebook below, you can see two cells. The first one enables passthrough authentication. The second cell reads all of the data from the “chris_has_access” folder into a dataframe and then displays the dataframe. You can see that I was able to read the file I placed in the folder with no problems.
Here’s the code to enable passthrough. Notice that there’s no reference to my ADLS account and no credentials or secrets.
# Configure passthrough authentication
configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}
The path on the data lake uses the format, “abfss://<file system>@<account name>.dfs.core.windows.net/<path>”.
# Access ADLS Directly
df = spark.read.format('csv').\
load("abfss://datalake@ccadlsg2.dfs.core.windows.net/secure/chris_has_access/").collect()
display(df)
This is fairly straightforward but a little clunky. In the next example, I’ll show you how to mount the storage so that you can treat it like an attached file system.
Mounting ADLS to Databricks
“Mounting” the storage to Databricks makes your file system look like it’s a folder local to the cluster even though it is still remote storage. Creating the mount can be done with one command. Once it’s done, it is persistent, and you can use file system commands to explore the folder structure. More importantly, you don’t have to remember “abfss://blah.blah.blah”. One note on file system commands. When you are using ADLS with passthrough security, you have to use the file system commands using the Databricks dbutils API. In the example below, I’m issuing the filesystem (“fs”) command “ls” to list the contents of the “secure” folder.
HEY! There’s that “chris_doesnt_have_access” folder! What if you try to list what’s in that folder? Great question!
It may not be readily apparent, but drilling into the error output you’ll find:
What you just saw is credential passthrough in action. Because I am logged into Databricks with my Azure AD account and because credential passthrough to Azure Data Lake Store is enabled, I can access data I’m allowed to see and data I’m not allowed to see is kept secure from me. This has been a really simple example, but you can see how this could be adapted to an enterprise data security model that compartmentalizes data and restricts access by role. Most importantly, it uses the same security tools and capabilities you are using throughout your organization.
Conclusion
At first look, file and folder level access control seem like they should be table stakes for any data platform. But in highly distributed big data systems, this hasn’t always been the case and access is often controlled with an all or nothing approach. In addition, the larger the platform gets, the harder it can be to secure because of the volume of data and the sheer number of machines in the infrastructure.
Azure has simplified this by integrating Azure Active Directory across services and by separating compute (Databricks) from storage (ADLS) so that they can scale independently. Building your data lake solution on Azure can cut months off of your project timeline and reduce management hours and operating costs significantly.
There are some limitations and gotchas to this approach so make sure to read and plan accordingly.
Here are some documentation links if you’d like to know more:
Azure Data Lake Storage Gen2 Overview: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction
Access control in Azure Data Lake Storage Gen2: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control
Authenticate to Azure Data Lake Storage using Azure Active Directory Credentials: https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/adls-passthrough
What is Azure Databricks: https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks
Simplify Data Lake Access with Azure AD Credential Passthrough: https://databricks.com/blog/2019/10/24/simplify-data-lake-access-with-azure-ad-credential-passthrough.html
Very informative. We built our Data Lake using Azure Blob but cannot wait to move to ADLS in future
I was definitely yelling "WAIT A MINUTE!"... thanks for creating the cliffhanger!