Make time for data profiling
The concept of the data warehouse has been around for some time now and data has come to be the life blood of the modern business. Disparate sources of data within companies have been integrated or, if not, are in the process of being integrated. The end goal is to provide management with a comprehensive and all-encompassing view of the business so that better business decisions and forecasts can be made through the medium of management reporting, While this may be the end goal, there are many steps along the way to producing this outcome. The first step is to model and create the target database according to tried and tested data warehouse principals and then to find the data in the disparate source systems that will populate the new data model. To management the idea is simple: take the data from the source table and whack it into the appropriate place in the target table. Job done. The reality, however, is not quite as simple as it seems.
The first challenge is to find a coding language or application that will facilitate this extraction of data from the data source or sources, perform the required transformation and then load the data into the target database - a process we know as Extract, Transformation and Load or ETL. Business management, naturally, has a keen interest in the speed of delivery thus reducing the cost of the project as a whole. Tools such as IBM DataStage, Talend, Microsoft SSIS or Informatica are therefore used to minimize the development time.
'So just code the jobs using one of these tools and the job's done?'
Not so fast, Mr Manager. We have just ASSUMED that the data from the data sources is all 'clean' and uniform.
'Why would the data from the data sources be anything but clean and uniform?'
For many reasons. The first is that source data, for example customer details, are stored differently on different data sources. In order to merge data from these sources, things like names and addresses need to be standardised in the same format. Then let's say we have a data source that is an Excel spreadsheet. There's little control over the way data is entered into a spreadsheet. It is possible that a user might press the 'Ctrl' key by accident while typing data into one of the columns thereby creating control characters that are hidden to the eye, but troublesome for data migration and database. Other human error, while capturing data, is also possible in other data sources emanating from business applications that don't have the correct level of data quality checks at the time the data is captured initially.
All this creates a nightmare scenario for developers involved with the migration of the data to the new data warehouse, but for some reason, management continually fails to take this into account.
'Surely any data problems can be ironed out during development?' they would argue. 'We have to fit the development to the timeline allowed by the project board, so just get on with it'.
What they fail to realise is that poor data quality from the data sources could contribute to an exponential rise in development time as the project proceeds. Although ETL tools provide a quicker development environment than line based coding and their features are geared towards isolating and correcting erroneous data, the developer first has to develop the migration code before it is found that the code doesn't work as expected, but through no fault in the coding. This results in hours spent locating the source of the problematic data. It may be just a few records in a million that cause the entire data flow to abort and finding the 'needle in the haystack' delays progress and increases development time much to the despair of the project board. This is just one example of the issues that can be caused by the lack of prior data profiling.
'So what's the solution?' asks the manager 'And how much will it cost?'
Hopefully the manager will ask this question before the project begins and not after some months down the line. The solution, dear Mr Manager, is to set aside a few weeks at the start of the project to determine things such as a) The data quality, b) The distribution of key data, c) cross-table linking and d) Formatting of Master Data such as customer names and addresses. This is a just a high-level view and there are many other nuances that can be determined and corrected through data profiling, but whichever way you look at it, it is an essential exercise that epitomises the old adage: 'a stitch in time saves nine' or 'penny wise, pound foolish'. I'm sure there are more of these old adages that allude to the same principle but that seem to be sadly lacking in management training. The reality is that the project steering committee is typically staffed mostly by people from the non- IT side of the business who don't understand the complexities involved with large-scale data migration. The responsibility for motivating the data profiling initiative therefore lies with the Project Manager who should at least understand the data migration or ETL process. Sadly, the decision makers often think that they know better than the technical staff that they hire to do the job.