What is 'Data Profiling' !

What is 'Data Profiling' !


Data profiling is the process of understanding the data, it's characteristics and quality. Data profiling incorporates a collection of analysis and assessment algorithms that, when applied in the proper context, will provide an empirical insight into what potential issues exist within a data set.

In short , its like looking at its structure and content, so that you can better understand how your data is relevant and useful, what it’s missing, and how it can be improved.

The question is now “how to do it” ?

Before starting the data profiling process, we need to finalize the sources that are going to be used in the project. Usually this would be based on the requirements at the target or the data warehouse or in the business Intelligence Report or can be any third party data source. Once we identify the sources then each of the tables and the sources must be profiled. We need to priorities our tables in each source based on the critical authority or usage, and start profiling each one of them.

Data profiling reports usually captures the following at column level. The statistical data that are captured are minimum, maximum, mean, mode percentile, standard deviation, frequency etc. That means how many times the value repeats, variation and aggregated sum. And the meta data related profiles are data types, length, uniqueness or duplicates, occurrence of null values to string patterns, etc and the table level primary key analysis like foreign key analysis, a referential integrity analysis, etc.

There are many tools that could be used to perform the data profiling, or you can simply write Sequels and perform the analysis, and that you could do only if the source is a database and if you have heterogeneous sources, including Excel files, text files, etc.

It is recommended to use a tool that is efficient, and it could help you to speed up the process of data profiling. Few open source data profiling tools are Talend Open Studio, Quadient DataCleaner & Apache Griffin etc. Few data governance software like Collibra, Informatica, Talend , Erwin by Quest, SAP Master data Governance etc provides data profiling features too. Also for quick check you can use R using package like dplyr, funModeling, haven, corrplot, tidyverse etc and for Python it’s pandas.

When to do data profiling ?

It is recommended that 'always perform one round of data profiling before designing the data model' because it could change your understanding of the data completely. Also, data profiling is not a one time process, it is a continuous process. The developers must profile the data and understand the nature of it before starting to code. By understanding the data, a developer can figure out the anomalies and resolve it early and also could find the inconsistencies in the data and perform required transformations to standardize them.

Data profiling is the first and critical piece of any data integration, data warehousing or data migration project or AI/ML model building and ensure to always understand your data with various data profiling methods before using it. It would save you a lot of time, effort and money. It increases the trust in the data among the business decision makers and helps them to make the right and effective decisions. Prioritization of candidate data stores and data sets for profiling should be based on business needs and objectives expressed in the data quality plan.

 

 


 

To view or add a comment, sign in

Others also viewed

Explore content categories