Data Bits : Clean your data: wash, rinse, repeat
Overview
As our data volumes and uses continue to grow, ensuring that data is clean becomes even more critical. While clean data SHOULD have always been a requirement in our systems, in today's data-driven world it is even more necessary. Clean data is data that is accurate, consistent, complete and error-free. The increasing reliance on data for business intelligence and insights means that clean data ensures that analysis is correct (or at least as correct as possible).
For years the directions on shampoo intrigued me: Wash, Rinse, Repeat. Perhaps my career slips to far into my daily routine but that seems like an infinite loop to me because there is no exit condition. Regardless, the same directions should be applied to data cleansing: Wash, Rinse, Repeat.
Wash
The first cycle of washing your data removes all the "dirt". Data cleansing processes include:
Data Deduplication
Duplicate and redundant data can lead to inaccurate results, including incorrectly inflated metrics. Identify the unique record identifiers for each data set and ensure they are not repeated. If duplicates are found the rest of the data must be evaluated to determine the differences and resolve the problem. Potential options are to remove the "oldest" or synthesize the change from multiple records into a single one.
Data Correction
Validating correctness involves evaluating each data object for potential issues. Checks include: correct data type, data value is contained within required domain and many others. If issues are identified a standard process should be followed. Options range from correct the data based on pre-defined options to stop the process and alert to the issue.
Data Standardization
Data elements often need to follow a specific format or standard. Some potential data elements that may need this are: phone numbers and postal addresses. If identified options for resolution are similar to Data Correction and range from correct the data based on pre-defined options to stop the process and alert to the issue.
Rinse
In the rinse cycle the data will be reviewed to ensure the wash performed completely. If the wash cycle is functioning correctly this cycle will only alerts on new issues. Processes include:
Recommended by LinkedIn
Verify data is consistent
Confirming consistencies involves validating formats, data lengths (maximum and minimum) and data types. Any issues identified by this process should probably stop the process and send an alert.
Verify data is complete
Confirm that all required data fields contain a value. Assuming the Wash cycle was functioned correctly, existing values outside of the required domain should have already been either flagged for review or automatically corrected. When missing values are identified the process should send an alert and possibly stop the process for correction.
Verify data is accurate
To verify that the data is now accurate compare the washed data with a secondary data set (trusted source or benchmark). Cross-reference this washed data with the trusted source. Any "dirt" identified should alert on what was found and possibly stop the process for additional review.
Repeat
Finally, the repeat phase. Unlike the shampoo directions this truly will be an "infinite loop" for many processes. Processes include:
Document
Documentation is an oft neglected step in many development cycles. However, here it is important to know what issues were identified and how they were resolved. Teams change over the life of a process and by documenting we reduce the amount of "relearning" required due to these changes.
Automate
If at all possible, automate the wash and rinse cycles. Let's face it, doing the same task over and over manually is time-consuming, prone to errors and sometimes mind-numbing. Automate the process, including built-in status alerts, and remove this from your "to do" list.
Summary
Clean data is important, now more than ever. Allocate the time required to wash, rinse and repeat. Your data consumers will appreciate it!