Data Profiling in Power BI(Power Query Editor)
As per the April 2019 update, Microsoft has introduced a data profiling capability in Power BI desktop.
Data profiling helps us easily find the issues with our imported data from data sources in to Power BI.
We can do data profiling in the Power Query editor. So, whenever we are connecting to any data source using Import mode, click on Edit instead of Load so that we can leverage the Data profiling capabilities for any column in Power BI desktop.
Note:By default, Column profiling is based on top 1000 rows, you can change it to show the column profiling based on entire data set.
In Power Query Editor, Under View tab in Data Preview Section we can see the following data profiling functionalities-
- Column quality
- Column distribution
- Column profile
1.Column quality: In this section, we can easily see valid, Error and Empty percentage of data values associated with the Selected table. As you are able to see Green line below this column, it means data is 100 % valid. In some cases, if data is not valid, you can see a different percentage value for selected column. Also, the bottom line for that column will be Red in the case of Error and black or dark gray in the case of Empty. It can be mixed if Some data is Valid and some are empty and with Error.
Note: You can perform various transformations like Remove Errors, Keep Errors, Replace Errors, Keep Duplicates, Remove Duplicates, etc., by clicking on the eclipses (…) or right clicking on the visual.
2.Column Distribution: In this section we can easily see the inline value distribution histogram. For example, # distinct records, # unique records. Once you hover on distributed histogram, you can also see the percentage associated with distinct and unique data for the selected column.
Note: You can perform various transformations like Remove Errors, Keep Errors, Replace Errors, keep duplicates, Remove Duplicates, etc., by clicking on the eclipses (…) or right clicking on the visual.
3.Column Profile: This is one of the most important features. As you can see for the selected column, you can see Column statistics which includes distribution measures like #Count, #error, #empty, #distinct, #Unique, #Empty string, Min, Max
You will also able to see Column distribution that will show a bar graph with the selected column on axis and the bar height will show the count for category variable.
Note: Column profile shows the Column statistics and column distribution for the selected column.
Whereas you can see column quality and column distribution for all the columns present in table at a time.
All features at one glance
Note: In addition to this, there is an M function "Table.Profile" that gives more advanced profile details. For more details here
You can also refer Microsoft documentation on this https://docs.microsoft.com/en-us/powerquery-m/table-profile
Thank you for reading this article!
Great work Amit Kumar!