What is 'Data Profiling' !

KaushiK Datta

Published Oct 6, 2023

Data profiling is the process of understanding the data, it's characteristics and quality. Data profiling incorporates a collection of analysis and assessment algorithms that, when applied in the proper context, will provide an empirical insight into what potential issues exist within a data set.

In short , its like looking at its structure and content, so that you can better understand how your data is relevant and useful, what it’s missing, and how it can be improved.

The question is now “how to do it” ?

Before starting the data profiling process, we need to finalize the sources that are going to be used in the project. Usually this would be based on the requirements at the target or the data warehouse or in the business Intelligence Report or can be any third party data source. Once we identify the sources then each of the tables and the sources must be profiled. We need to priorities our tables in each source based on the critical authority or usage, and start profiling each one of them.

Data profiling reports usually captures the following at column level. The statistical data that are captured are minimum, maximum, mean, mode percentile, standard deviation, frequency etc. That means how many times the value repeats, variation and aggregated sum. And the meta data related profiles are data types, length, uniqueness or duplicates, occurrence of null values to string patterns, etc and the table level primary key analysis like foreign key analysis, a referential integrity analysis, etc.

There are many tools that could be used to perform the data profiling, or you can simply write Sequels and perform the analysis, and that you could do only if the source is a database and if you have heterogeneous sources, including Excel files, text files, etc.

It is recommended to use a tool that is efficient, and it could help you to speed up the process of data profiling. Few open source data profiling tools are Talend Open Studio, Quadient DataCleaner & Apache Griffin etc. Few data governance software like Collibra, Informatica, Talend , Erwin by Quest, SAP Master data Governance etc provides data profiling features too. Also for quick check you can use R using package like dplyr, funModeling, haven, corrplot, tidyverse etc and for Python it’s pandas.

Recommended by LinkedIn

Change Data Capture in Snowflake via Streams

Deepak Rajak 5 years ago

Automating Data Analytics: 5 Quick Tips to Improve…

Brian Segobiano 9 years ago

Data Storage

Shubhashree Deshpande 2 years ago

When to do data profiling ?

It is recommended that 'always perform one round of data profiling before designing the data model' because it could change your understanding of the data completely. Also, data profiling is not a one time process, it is a continuous process. The developers must profile the data and understand the nature of it before starting to code. By understanding the data, a developer can figure out the anomalies and resolve it early and also could find the inconsistencies in the data and perform required transformations to standardize them.

Data profiling is the first and critical piece of any data integration, data warehousing or data migration project or AI/ML model building and ensure to always understand your data with various data profiling methods before using it. It would save you a lot of time, effort and money. It increases the trust in the data among the business decision makers and helps them to make the right and effective decisions. Prioritization of candidate data stores and data sets for profiling should be based on business needs and objectives expressed in the data quality plan.

To view or add a comment, sign in

What is 'Data Profiling' !

KaushiK Datta

Recommended by LinkedIn

Others also viewed

Optimizing Data Refresh in Snowflake: Streams and Tasks vs. Dynamic Tables vs. Materialized Views

CONCEPTS OF DATA STAGING : INTERVIEW PREP GUIDE

A Deep Dive into Data Ware Housing

Snowflake Constraints: The Hidden Power of ENABLE, VALIDATE, and RELY

It’s Time to Get Your Data House in Order

Let's build a Data Warehouse!

Streamlining Data Warehouse

Relational v’s Star Data Schema’s

Layman's overview of Big Data and Analytic's

Data Warehouses vs. Data Marts

Explore content categories

Recommended by LinkedIn

Others also viewed

Optimizing Data Refresh in Snowflake: Streams and Tasks vs. Dynamic Tables vs. Materialized Views

CONCEPTS OF DATA STAGING : INTERVIEW PREP GUIDE

A Deep Dive into Data Ware Housing

Snowflake Constraints: The Hidden Power of ENABLE, VALIDATE, and RELY

It’s Time to Get Your Data House in Order

Let's build a Data Warehouse!

Streamlining Data Warehouse

Relational v’s Star Data Schema’s

Layman's overview of Big Data and Analytic's

Data Warehouses vs. Data Marts

Similar topics

Data Cleaning Techniques for Accurate Analysis

Streamline Your Data Analyst LinkedIn Profile

Explore content categories