Week 22 of #100WeeksofAzureDataAI: Classify data using Azure Purview custom classification💡
“Changing the world happens one life at a time.” John C. Maxwell & Rob Hoskins
Hi friends, this week I've learned how to create custom classification in Azure Purview using Regular Expression on Security Camera data around Perth City.
💻Tech and services in this article: Azure Purview, Azure Data Lake Storage Gen 2
😍Fun stuff at work: Still riding to raise funds with KM for Kids and Microsoft 🚲! I've hit 1000km and $1000 in 19 days!, I also did an OpenHack for Modern Data Warehousing which I learned about DevOps + Azure Synapse Analytics too and met great solution architects from APAC!
🗝️Custom Classification in Purview using Regular Expression
In Azure Purview, classification is a way of organising data assets by giving them a unique tag/class. Classification comes in handy when we discover data and look at compliance insights. We can classify both structured (CSV, CSV, JSON, SQL table, etc) and unstructured data (doc, PDF, txt, etc).
Be careful not to overuse classification else it becomes noisy and misleading to data consumers. A few key use cases outlined in this Tech community blog Classify your data using Azure Purview - Microsoft Tech Community:
Although Purview comes with 200+ classifications (i.e. person name, Australia Business Number (ABN), credit cards, passport in different countries, etc), organisations often have special use cases to classify their data.
This is when we have to create a custom classification that can be reused at scale when scanning data sources. My customer has a requirement to classify a unique ID for their work order which isn't part of the system default.
For testing purposes, I grabbed an open dataset from Security Cameras - Datasets - data.wa.gov.au which is Security Cameras. This dataset represents general locations of surveillance cameras within the City of Perth. There are only 4 columns: X coordinate, Y coordinate, unique ID and Camera ID. So says we want to apply the classification of the Camera ID column which has the pattern of CL{Digit}{Digit}{Digit}.
I had to separate some data for training because custom classification uses machine learning in the background to learn the Regular Expression pattern. Creating custom classification is effortless, follow detailed steps here Create a custom classification and classification rule - Azure Purview | Microsoft Docs.
Step 1: Create a new classification
Make sure we use a name-spacing convention i.e. organisation name. classification name. For example, DeptOfOracle.SurveillanceCamera. Once created, the new classification can be found in the custom tab.
Now, if we click inside, it has no rule associated with this yet. We have to create a new rule and provide a data pattern to classify our data.
Step 2: Create custom classification rules
We have to define the rule pattern of the Camera ID which is used by the scanning system. It will try to identify every instance of where the Camera ID - CL{Digit}{Digit}{Digit} pattern is found in our data assets. Notice that in "Classification name", DeptOfOracle.SurveillanceCamera is selected. This means we always have to create classification first before we can create classification rules. Just remember that we need a place for the rule to live in.
We can use either Regular Expression (Regex) or Dictionary (of Master data) to define the pattern. I don't have any dictionary so in this example, we're using Regex.
We need to upload a file that has our training set. Azure Purview will then suggest some Regex patterns which is ^CL[0-9]{3}$ for the Camera ID column (anything that starts with CL and follow by 3 digits). We can use the Minimum match threshold to set the minimum percentage of the distinct data value matches in a column that must be found by the scanner for the classification to be applied. The default value is 60%. There's an option to test the classification rule, we need to supply a test set (another CSV file) and Purview does it all.
Recommended by LinkedIn
If all is well, once the rule is created, we should see it in the custom tab.
Step 3: Create a scan rule set
One of the great features in Azure Purview is to auto classify data assets during the scan. We need to create a Scan rule set so that our DeptOfOracle.SurveillanceCamera can be auto-applied. Think of a scan rule set as a container for group a set of rules such as what classification to use, what file type and what data source (in this case it's in ADLS Gen2)
Only choose CSV here because I know that the data is only in CSV format.
This is where we include or exclude any rules you want. Definitely don't forget the "SurveillanceCamera" under Custom Rules. Notice that we can only create a new rule set when we have already created 1). A classification and 2). A classification rule
We can find this custom rule under the Custom tab.
Step 4: Scan a data source and select the custom classification rule!
And we scope our scan to the folder that has the camera file
And this is how all things we created come together! We pick the SurveillanceCamera rule set then trigger the scan.
Et voilà! We can see that the asset is automatically applied to the custom classification.
Keep in mind that only Purview Data Readers can view all classifiers and classification rules and Purview Data Curators can create, update, and delete custom classifiers and classification rules.
Thank you for spending time to read. I hope you find something relevant or learn something you didn't know before. Don't hold back on your constructive feedback and suggestions. It will be greatly appreciated here 🙏. #Azure. Invent with purpose. ☁
Hi Jiaranai Keatnuxsuo (เจียระไน เกียรตินักสู้), I have read your blog on custom classification in Purview. I am also working on a data governance project on the Microsoft Purview platform. We have Azure Synapse Analytics as a data source. We have created over 200 custom classification rules. But when we are trying to create a scan rule set, we can select only 143 custom classification rules in one scan rule set. We cannot select more than 143. Have you faced this issue? And if faced, what have you done to overcome this problem?