GDPR compliance using Azure Databricks

GDPR compliance using Azure Databricks

Depending on the geography, we have different data compliances like GDPR, CCPA. During the design of our data solutions, it is mandatory that we should keep security and compliance requirements into account and take action to achieve them. Most cloud providers follow the security guidelines and keep the data safe. During the data processing sometimes it becomes mandatory to keep the Personal Data on the platform. Keeping PII data safe and compliant with the GDPR requirements is one of the main requirements in any project.

Here are the security questions commonly asked while dealing with GDPR compliance. Before we are going to discuss “securing the data” in detail in this blog, let's understand in brief the available capabilities which are present in Azure to secure the environment. 

No alt text provided for this image

How can we secure the environment?

Azure cloud keeps the data safe and allows us to define whom we would like to give accesses to. The data is always encrypted at rest. Different tools are available to enable different security capabilities in Azure. Having proper authorization for the application which is handling the data is the key to success towards making the platform secure.

  • Azure Databricks is integrated with the Azure Active Directory, it provides the capability to disallow any unauthorized or anonymous access.
  • Use Azure Key Vault to store the secrets like, the SAS key for the ADLS gen 2 or storage accounts. These keys are basically needed to create the mount point in the Databricks environment. Databricks provide the integration with the Azure Key Vault which is a preferred way of handling the secrets link.
  • Isolate the Environment: Privatize the network using the Azure Private Link. This basically brings the Databricks environment into On-Premises/ cloud virtual network by mapping it to a private endpoint.  Traffic between your virtual network and the service traverses over the Microsoft backbone network, eliminating exposure from the public Internet.

How can we secure the data?

Here are few data subject rights in GDPR –

  • Right of access
  • Right to rectification
  • Right to erasure/ right to be forgotten

The challenge of the data lake is that the data lake has a large amount of data that can have different levels of classification and sensitivity. The data lake often has the PII information along with the other data. One of the biggest challenges is to handle this PII data while the other data is still accessible by the engineers or users. We can apply different design approaches, and data bricks feature together to secure the PII data. This basically takes care of the “right to access” principle of GDPR. We will use a data masking approach to obfuscate the data.

To handle the “right to be forgotten” requirement, the biggest challenge is how can we erase the data from Data Lake when we have a large volume of data? How can we erase the data at the record level? Databricks Delta table provides the ACID capabilities which can be used to handle this. The Pseudonymization technique can be used to handle the delete of the PII data from the system. Pseudonymize all customer personal information in your working tables (prior to receiving a data subject request) and delete the customer entry from the “lookup table” that maps the customer to the pseudonym, while ensuring that data in working tables cannot be used to reconstruct the customer’s identity

Here is a reference architecture that basically implements 2 approaches Pseudonymization and Data Masking to secure the data. Here are the key objectives which we are trying to achieve here -

  • Efficient Delete Operation.
  • Secure the PII data.
  • Easy way to decrypt the PII data. 

Data Flow Diagram

No alt text provided for this image

1-    Raw files might have sensitive PII data. Raw files are ingested into the Databricks Delta table the moment they are received in the platform. We can have a policy that basically deletes the file in a regular interval. To implement this, we can use Azure Data Factory Schedule pipelines. Azure Data Factory will read the containers/ folders and look at the modified date of files to determine if we need to delete the files or not. For this, we can use the delete task activity in Data Factory. Example.

2-    Pseudonymization of PII data: Segregate the sensitive PII information into a separate table. If there is any personal data in the system. We should create a Pseudo key for all the records to uniquely identify before we move them to the downstream tables ( transactional table). The idea is that in the data ingestion framework, the moment we read the PII data from the raw data lake/ raw files, we first create the Pseudo key, and then pass the Pseudo key to determine the customers into the downstream tables.

3-    We have a look-up table that basically holds unique encryption keys for all the different customers.

4-    Mask the PII data in our sensitive data table using the encryption keys.

5-    The downstream tables hold the Pseudo key to identify the customers.

6-    How delete operation happens? When a delete request comes, we are going to delete the encryption key for that customer from the lookup table. The moment we delete the encryption key, we lost the capability to decrypt the customer information and the records become anonymous. Once the encryption key is deleted, we make sure that it is physically deleted as well from the delta lake. We use the VACCUM operation on the table. It restricts us to restore back the previous version of the table using the time-travel feature of Databricks.

VACUUM gdpr.encryption_keys RETAIN 0 HOURS        

7-    We will have a mechanism to decrypt the PII data using the encryption keys.

8-    The Decrypted data should not be shown to everyone; hence we should be having a Databricks ACL is in place to restrict the access level on the Decrypted view. 

Implementation:

No alt text provided for this image

Few assumptions:

  1. we considered that email address as the only PII data that we have, the notebook implementation can be extended to any other PII columns/ fields.
  2. gdpr.sales table is not shown in the notebook. In the diagram, it is just to showcase the mapping between the customer table and transactional table.
  3. Databricks Runtime version : 7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12) was used for the implementation

Source Code

Github link

Execution Result and Databricks Notebook

No alt text provided for this image

jar file for the persistent hive function

Results

At the end of the notebook execution, we should have the below packages installed in the cluster

No alt text provided for this image

decypteudf_2_12_1_0.jar - this is the hive function that needs to be locally built and upload to the databricks cluster.

fernet maven package for java has to be installed in the cluster to support the hive function dependencies.

Raw Data with Pseudo key

No alt text provided for this image

Encryption Keys for each customer

No alt text provided for this image

Encrypted data post Encryption

No alt text provided for this image

Databricks View to See the Decrypted Data

No alt text provided for this image


How can we optimize Delete Operation on the Delta table?

In this data bricks official blog, some tips are given to optimize the delete operation from the delta table. When we have millions of records in the delta table, it becomes a costly operation to delete data from the delta table. Hence, we should implement a few of the best practices before we start deleting data.

autoBroadcastJoinThreshold

while starting the delete operation, let's set the autoBroadcastJoinThreshold property to 10 MB. It configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled, and it has to be set when we are done with the delete operation.

Before the delete operation.

%sql
set spark.sql.autoBroadcastJoinThreshold = 1024000000        

Post delete operation.

%sql
set spark.sql.autoBroadcastJoinThreshold = -1        

Z-Order indexing

When we have a large dataset to delete, the delete operation has to scan multiple smaller files in the delta lake to find out the exact files which need to be edited. Assume when we have a delta table which GBs of data, we will have thousands of files to scan which is going to take a long amount of time. Hence, it is recommended that Before we start the delete operation we should do the Z-Order Indexing on the Key columns.

Key column: the column which will be basically in the where clause, or the column which will be used to join the intermediate delete table as mentioned in the above section.

Consider that the CustomerID field uniquely identify the records in the bigger transactional table, Before starting the delete operation

spark.sql("optimize gdpr.raw_customer_data_test zorder by CustomerID")        

The above command will now basically restructure the delta table.

Hope this helps!

Good post ! Why are you using a hive udf instead of spark udf ? Hive udf is a black box for spark catalyst.

Thanks for posting 👍

Like
Reply

To view or add a comment, sign in

More articles by Samarendra Panda

Others also viewed

Explore content categories