GDPR compliance using Azure Databricks
Depending on the geography, we have different data compliances like GDPR, CCPA. During the design of our data solutions, it is mandatory that we should keep security and compliance requirements into account and take action to achieve them. Most cloud providers follow the security guidelines and keep the data safe. During the data processing sometimes it becomes mandatory to keep the Personal Data on the platform. Keeping PII data safe and compliant with the GDPR requirements is one of the main requirements in any project.
Here are the security questions commonly asked while dealing with GDPR compliance. Before we are going to discuss “securing the data” in detail in this blog, let's understand in brief the available capabilities which are present in Azure to secure the environment.
How can we secure the environment?
Azure cloud keeps the data safe and allows us to define whom we would like to give accesses to. The data is always encrypted at rest. Different tools are available to enable different security capabilities in Azure. Having proper authorization for the application which is handling the data is the key to success towards making the platform secure.
How can we secure the data?
Here are few data subject rights in GDPR –
The challenge of the data lake is that the data lake has a large amount of data that can have different levels of classification and sensitivity. The data lake often has the PII information along with the other data. One of the biggest challenges is to handle this PII data while the other data is still accessible by the engineers or users. We can apply different design approaches, and data bricks feature together to secure the PII data. This basically takes care of the “right to access” principle of GDPR. We will use a data masking approach to obfuscate the data.
To handle the “right to be forgotten” requirement, the biggest challenge is how can we erase the data from Data Lake when we have a large volume of data? How can we erase the data at the record level? Databricks Delta table provides the ACID capabilities which can be used to handle this. The Pseudonymization technique can be used to handle the delete of the PII data from the system. Pseudonymize all customer personal information in your working tables (prior to receiving a data subject request) and delete the customer entry from the “lookup table” that maps the customer to the pseudonym, while ensuring that data in working tables cannot be used to reconstruct the customer’s identity
Here is a reference architecture that basically implements 2 approaches Pseudonymization and Data Masking to secure the data. Here are the key objectives which we are trying to achieve here -
Data Flow Diagram
1- Raw files might have sensitive PII data. Raw files are ingested into the Databricks Delta table the moment they are received in the platform. We can have a policy that basically deletes the file in a regular interval. To implement this, we can use Azure Data Factory Schedule pipelines. Azure Data Factory will read the containers/ folders and look at the modified date of files to determine if we need to delete the files or not. For this, we can use the delete task activity in Data Factory. Example.
2- Pseudonymization of PII data: Segregate the sensitive PII information into a separate table. If there is any personal data in the system. We should create a Pseudo key for all the records to uniquely identify before we move them to the downstream tables ( transactional table). The idea is that in the data ingestion framework, the moment we read the PII data from the raw data lake/ raw files, we first create the Pseudo key, and then pass the Pseudo key to determine the customers into the downstream tables.
3- We have a look-up table that basically holds unique encryption keys for all the different customers.
4- Mask the PII data in our sensitive data table using the encryption keys.
5- The downstream tables hold the Pseudo key to identify the customers.
6- How delete operation happens? When a delete request comes, we are going to delete the encryption key for that customer from the lookup table. The moment we delete the encryption key, we lost the capability to decrypt the customer information and the records become anonymous. Once the encryption key is deleted, we make sure that it is physically deleted as well from the delta lake. We use the VACCUM operation on the table. It restricts us to restore back the previous version of the table using the time-travel feature of Databricks.
VACUUM gdpr.encryption_keys RETAIN 0 HOURS
7- We will have a mechanism to decrypt the PII data using the encryption keys.
8- The Decrypted data should not be shown to everyone; hence we should be having a Databricks ACL is in place to restrict the access level on the Decrypted view.
Implementation:
Few assumptions:
Source Code
Recommended by LinkedIn
jar file for the persistent hive function
Results
At the end of the notebook execution, we should have the below packages installed in the cluster
decypteudf_2_12_1_0.jar - this is the hive function that needs to be locally built and upload to the databricks cluster.
fernet maven package for java has to be installed in the cluster to support the hive function dependencies.
Raw Data with Pseudo key
Encryption Keys for each customer
Encrypted data post Encryption
Databricks View to See the Decrypted Data
How can we optimize Delete Operation on the Delta table?
In this data bricks official blog, some tips are given to optimize the delete operation from the delta table. When we have millions of records in the delta table, it becomes a costly operation to delete data from the delta table. Hence, we should implement a few of the best practices before we start deleting data.
autoBroadcastJoinThreshold
while starting the delete operation, let's set the autoBroadcastJoinThreshold property to 10 MB. It configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled, and it has to be set when we are done with the delete operation.
Before the delete operation.
%sql
set spark.sql.autoBroadcastJoinThreshold = 1024000000
Post delete operation.
%sql
set spark.sql.autoBroadcastJoinThreshold = -1
Z-Order indexing
When we have a large dataset to delete, the delete operation has to scan multiple smaller files in the delta lake to find out the exact files which need to be edited. Assume when we have a delta table which GBs of data, we will have thousands of files to scan which is going to take a long amount of time. Hence, it is recommended that Before we start the delete operation we should do the Z-Order Indexing on the Key columns.
Key column: the column which will be basically in the where clause, or the column which will be used to join the intermediate delete table as mentioned in the above section.
Consider that the CustomerID field uniquely identify the records in the bigger transactional table, Before starting the delete operation
spark.sql("optimize gdpr.raw_customer_data_test zorder by CustomerID")
The above command will now basically restructure the delta table.
Hope this helps!
Good post ! Why are you using a hive udf instead of spark udf ? Hive udf is a black box for spark catalyst.
Thanks for posting 👍