Data Exploration using data profiling tools on Power BI with Power Query
Data Exploration is important. It is part of the initial transformation stage of understanding your data before finally visualizing the data.
Some Data Exploration Activities on Power Query
There are different ways to explore and prepare your data but I will only talk about those from the data profiling view that can be done from the "View Tab" on the power query. On this tab, you can preview your data by column to better drill down on each entry to see how it affects the entire column as a whole.
This is initially set to work on the first 1000 rows of the column. However, the settings can be changed to the entire column from the lower-left corner of your editor window from Column profiling based on top 1000 rows to column profiling to Column profiling based on entire data set.
1. Data Quality: When you check "Column quality" on view tab, this will display for each column the percentages of valid values (that is a value that is not null), error (just like the name suggests, having errors like #Value!), Empty (null entries).
The percentages are calculated as the total number of Valid/Error/Empty divided by the total number of entries. This helps us understand the quality of data in the column.
2. Data Distribution: The "Column Distribution" shows a bar chart of how the data in each column is distributed by Unique or distinct measurement.
Recommended by LinkedIn
Distinct is a count of each entry once at a time even if it is repeated in the column. However, unique is a count of entries that were never repeated more than once. The higher the number of repeated entries for a value the longer the bar chart representation of that value.
Example: Let's say you have a column of animals like [dog, cow, dog, rabbit, rabbit, dog], there will 3 distinct data (that is dog, cow, and rabbit) but 1 unique data (cow). This helps explain it better.
3. Data Profile: Checking "Column Profile" shows the statistical summary of the selected column. It is a combination of indicators seen in column quality, column distribution, and a lot more depending on the column's data type.
A text column will include {Count, empty strings, min, and max}, a date column will also include {Count, empty strings, min and max, average}, a number column will include in addition {Count, NAN, zero, min, max, average, standard deviation}.
Although this approach helps us preview our data during data exploration, caution should be taken when working with a large dataset as this might make the process of exploration slower. You can read more about using data profiling tools on Microsoft Learn.
great read, well done!