Week 22 of #100WeeksofAzureDataAI: Classify data using Azure Purview custom classification💡

Week 22 of #100WeeksofAzureDataAI: Classify data using Azure Purview custom classification💡

“Changing the world happens one life at a time.” John C. Maxwell & Rob Hoskins

Hi friends, this week I've learned how to create custom classification in Azure Purview using Regular Expression on Security Camera data around Perth City.

💻Tech and services in this article: Azure Purview, Azure Data Lake Storage Gen 2

😍Fun stuff at work: Still riding to raise funds with KM for Kids and Microsoft 🚲! I've hit 1000km and $1000 in 19 days!, I also did an OpenHack for Modern Data Warehousing which I learned about DevOps + Azure Synapse Analytics too and met great solution architects from APAC!

No alt text provided for this image

🗝️Custom Classification in Purview using Regular Expression

In Azure Purview, classification is a way of organising data assets by giving them a unique tag/class. Classification comes in handy when we discover data and look at compliance insights. We can classify both structured (CSV, CSV, JSON, SQL table, etc) and unstructured data (doc, PDF, txt, etc).

Be careful not to overuse classification else it becomes noisy and misleading to data consumers. A few key use cases outlined in this Tech community blog Classify your data using Azure Purview - Microsoft Tech Community:

  • Describe the type of data that exists in data asset or schema. We can use classification to identify the content i.e. this table has a column that holds credit card numbers.
  • Describe the phases in data preparation processes using classification i.e. this storage is classified as a raw, enriched, curated zone.
  • Set priorities and develop a plan to allocate budget and resources to achieve security and compliance requirements

No alt text provided for this image

Although Purview comes with 200+ classifications (i.e. person name, Australia Business Number (ABN), credit cards, passport in different countries, etc), organisations often have special use cases to classify their data.

This is when we have to create a custom classification that can be reused at scale when scanning data sources. My customer has a requirement to classify a unique ID for their work order which isn't part of the system default.

For testing purposes, I grabbed an open dataset from Security Cameras - Datasets - data.wa.gov.au which is Security Cameras. This dataset represents general locations of surveillance cameras within the City of Perth. There are only 4 columns: X coordinate, Y coordinate, unique ID and Camera ID. So says we want to apply the classification of the Camera ID column which has the pattern of CL{Digit}{Digit}{Digit}.

No alt text provided for this image

I had to separate some data for training because custom classification uses machine learning in the background to learn the Regular Expression pattern. Creating custom classification is effortless, follow detailed steps here Create a custom classification and classification rule - Azure Purview | Microsoft Docs.

 

Step 1: Create a new classification

Make sure we use a name-spacing convention i.e. organisation name. classification name. For example, DeptOfOracle.SurveillanceCamera. Once created, the new classification can be found in the custom tab.

No alt text provided for this image
No alt text provided for this image

Now, if we click inside, it has no rule associated with this yet. We have to create a new rule and provide a data pattern to classify our data.

No alt text provided for this image

Step 2: Create custom classification rules

We have to define the rule pattern of the Camera ID which is used by the scanning system. It will try to identify every instance of where the Camera ID - CL{Digit}{Digit}{Digit} pattern is found in our data assets. Notice that in "Classification name", DeptOfOracle.SurveillanceCamera is selected. This means we always have to create classification first before we can create classification rules. Just remember that we need a place for the rule to live in.

We can use either Regular Expression (Regex) or Dictionary (of Master data) to define the pattern. I don't have any dictionary so in this example, we're using Regex.

No alt text provided for this image

We need to upload a file that has our training set. Azure Purview will then suggest some Regex patterns which is ^CL[0-9]{3}$ for the Camera ID column (anything that starts with CL and follow by 3 digits). We can use the Minimum match threshold to set the minimum percentage of the distinct data value matches in a column that must be found by the scanner for the classification to be applied. The default value is 60%. There's an option to test the classification rule, we need to supply a test set (another CSV file) and Purview does it all.

No alt text provided for this image
No alt text provided for this image

 

 If all is well, once the rule is created, we should see it in the custom tab.

No alt text provided for this image


Step 3: Create a scan rule set

One of the great features in Azure Purview is to auto classify data assets during the scan. We need to create a Scan rule set so that our DeptOfOracle.SurveillanceCamera can be auto-applied. Think of a scan rule set as a container for group a set of rules such as what classification to use, what file type and what data source (in this case it's in ADLS Gen2)

No alt text provided for this image

 

Only choose CSV here because I know that the data is only in CSV format.

No alt text provided for this image

 This is where we include or exclude any rules you want. Definitely don't forget the "SurveillanceCamera" under Custom Rules. Notice that we can only create a new rule set when we have already created 1). A classification and 2). A classification rule

No alt text provided for this image

We can find this custom rule under the Custom tab.

No alt text provided for this image

Step 4: Scan a data source and select the custom classification rule!

No alt text provided for this image

And we scope our scan to the folder that has the camera file

No alt text provided for this image

 And this is how all things we created come together! We pick the SurveillanceCamera rule set then trigger the scan.

No alt text provided for this image

Et voilà! We can see that the asset is automatically applied to the custom classification.

No alt text provided for this image

Keep in mind that only Purview Data Readers can view all classifiers and classification rules and Purview Data Curators can create, update, and delete custom classifiers and classification rules.

Thank you for spending time to read. I hope you find something relevant or learn something you didn't know before. Don't hold back on your constructive feedback and suggestions. It will be greatly appreciated here 🙏. #Azure. Invent with purpose. ☁

Hi Jiaranai Keatnuxsuo (เจียระไน เกียรตินักสู้), I have read your blog on custom classification in Purview. I am also working on a data governance project on the Microsoft Purview platform. We have Azure Synapse Analytics as a data source. We have created over 200 custom classification rules. But when we are trying to create a scan rule set, we can select only 143 custom classification rules in one scan rule set. We cannot select more than 143. Have you faced this issue? And if faced, what have you done to overcome this problem?

Like
Reply

To view or add a comment, sign in

More articles by Jiaranai Keatnuxsuo (เจียระไน เกียรตินักสู้)

  • Sustainability + Technology #32

    This week at the intersection of Sustainability and Technology: Are you throwing out perfectly repairable tech?…

    1 Comment
  • Sustainability + Technology #31

    This week at the intersection of Sustainability and Technology: HD Hyundai to work with KR on verification of…

    1 Comment
  • Sustainability + Technology #30

    This week at the intersection of Sustainability and Technology: Schneider Electric unveils revamped data centre…

  • Sustainability + Technology #29

    This week at the intersection of Sustainability and Technology: AI-Driven Creators Are Better For The Environment Than…

  • Sustainability + Technology #28

    This week at the intersection of Sustainability and Technology: AI may hold a key to the preservation of the Amazon…

    1 Comment
  • Sustainability + Technology #27

    This week at the intersection of Sustainability and Technology: Reduce cloud waste with proper provisioning best…

    1 Comment
  • Sustainability + Technology #26

    This week at the intersection of Sustainability and Technology: How Are 3D Printed Coral Reef Projects Revitalizing…

  • Sustainability + Technology #25

    This week at the intersection of Sustainability and Technology: New Tree Tech: AI, Drones, Satellites and Sensors Give…

  • Sustainability + Technology #24

    This week at the intersection of Sustainability and Technology: The WasteTrade Circular Economy | A Tech Solution to…

  • Sustainability + Technology #23

    This week at the intersection of Sustainability and Technology: Microsoft Issues Code Green Alert How coders can help…

Others also viewed

Explore content categories