Is data science taking all the glory?
Data Science is currently a hot topic. I have been working with data pretty much my whole career as an MDM architect, across multiple customers and industries. Essentially everybody is trying to do the same thing, which is to get the best out of their data and I have helped clients achieve that.
More recently my work has taken me into the world of analytics. We are using Master Data Management (MDM) techniques and technology to improve how analytics are executed. Over the past two years I have been working in tandem with Capgemini’s data science team in this important domain.
Whilst working with this team I have asked myself a few times “so what exactly is data science, and why does it seem to be taking all the glory?”
According to Wikipedia, “Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured”. Now this sounds very similar to the world that I typically work in, with MDM. Data scientists need to take multiple sets of customer data, join them together, analyse them in various ways and then produce insight into the customer’s business that they wouldn’t previously have had access to.
Therefore, is data science taking all the glory? Well, before I think we can decide let us consider a couple of points. What do data scientists need? Computers? Applications? Data? Computers are commodity items these days, and applications might take some of the credit away, but when using scripting languages like R and Python, the credit probably stays with the scientist. However, I think it is with the data where there is a point to be made.
In order to get the most out of data science you want to spend as much time as possible on the science, rather than on the data. What do I mean by this? How often as an analyst or manager do you find that you’re spending all of your time managing data instead of using it? You want to be able to access the data immediately, consistently and have it as accurate as possible so that the science can be applied.
This is where having a well-designed data platform is pivotal. Instead of having your scientist develop, design, and build siloed pots of data each time, give them a platform to work with and let them do what they are good at. Ultimately, managing data should not be a task taking up their time.
A data platform, built with colleagues from both Entity Group and Capgemini, has provided the foundations for my data science colleagues to receive the glory. I am not going to discuss the infrastructure, software and architecture that we have put in place for this, but I am going to talk about the data side of things which I think is far more interesting.
Within this data platform we have implemented what we call our Analytical Data Store (ADS). At first it might sound like any other data mart in a warehouse, but it is much more than that. Within the ADS we have several layers to support our scientists. Firstly, I am a great believer in keeping things simple. The ADS, therefore, has only three logical data layers.
The first, is an historical layer which captures all of our customer’s data sources needed for analysis. The second is what we call our canonical layer, which is a logical representation of the customer’s business domains. The third is our matched layer, which works above the second and joins together entities across the business domains.
At the historical layer we do very little in the way of processing data, keeping it as true to the original data source as possible. This is useful for a number of reasons. It reduces the time spent on ingesting the data sources, and it also allows for analysis to be done on the original data, which is extremely important.
The next layer is essentially a canonical model, where we transform the data from our historical layer into this model, applying data cleansing routines and standardising the data as much as possible. The canonical model is a logical representation of all of the important entities from across the customer’s entire business. This allows different business functions to analyse, and report on the same sets of data in a consistent format. By bringing all the customers’ business domains together in one place it allows for the sharing of resources, and more importantly, of data.
The next layer is the Matched Layer, which is where we get our value add. Here is where we are truly using MDM for analytics, which has recently been a hot topic for a number of MDM vendors (IBM, Informatica, Tibco, etc.). We use MDM to build single views of entities for our data scientists.
Why is this important? Well if you are not familiar with most IT systems, there isn’t normally a “key” that can be used join up all the data. We use MDM technology to provide this logical key by matching across data sets and joining them together. We match data across systems based on important entity data domains (e.g. citizen, party, patient, household, organisation etc). We then produce what we call a “single view of an entity”. This is a talking point in MDM, but essentially it boils down to two approaches; you can either create a single record of truth (Transactional MDM) or you can create a single view of truth (Registry MDM).
For analytics we have found that a single view of truth (single view of our entities) works best, especially when different use cases may need to exclude or include data at a record or source level. With a single physical record this becomes much more complex to manage and use, as all the data has been pre-merged together into one record. With a single view, however, the original records still exist in their own right, but are logically grouped together into a view. The user can then simply include or exclude records at any level allowing them to use the canonical layer for analysis and to answer a range of questions.
In addition to establishing a view of these grouped records, we also store the match confidence at which the MDM has joined these records together. This gives the user another dimension by which to filter their view (e.g. they may only be interested in an extremely high likelihood matches for their work).
So how do we do the matching? Well this is where I am going to have to reference a particular MDM vendor (other vendors are available!). In order to create our matched layer we use the IBM MDM Standard Edition (IBM MDM SE) software suite. We do not make use of all of its features, but we do use the Probabilistic Matching Engine (PME). It allows us to provide back to the users a likelihood score of the match that has been made.
The matching algorithms are very powerful and tolerant to typical Data Quality (DQ) issues. They use different matching techniques to determine if two or more records represent the same entity. For example, do these two records in different systems represent the same person?
Now just having the person’s name is not a great deal of information to go on for the matching algorithm, but it will give us a likelihood score all the same. The IBM engine takes into account the data that it is working with. If it was looking at a set of Scottish data then it is highly likely that there are a lot of records that contain “Stuart”, ”Stewart”, “McDonald” and “MacDonald” and therefore given only the name information it is going to give a very a low confidence score.
This is because the IBM engine uses algorithms to determine how close data values appear (i.e. STUART to STEWART compared to STUART to MARCUS) and statistical sampling of data to learn how much weight each piece of information contains (e.g. STUART is a more common name that MARCUS, yet HORACE is even less common that MARCUS).
Obviously in reality we require more than just one attribute to make a match decision (e.g. Name, Address, Sex, Date of Birth, Phone Number etc). Once we have made a match decision, we store the score at which a match was made, allowing our users to analyse the data from their own business area’s point of view. For example:
If you are in marketing then you might be happy to group these two records together for analysis. However, if you are in healthcare you probably would not. The point is that the matching algorithms gives the users the ability to dial up and down their matching threshold based on their use case.
One other feature is the ability for users to import their own data into the data platform. Once imported they are able to “run” this data against any number of services to enrich it before it is analysed. Typically they “wash” their data against the matched layer to see which of their records “stick” to data within the ADS. Key to this process is using the matching algorithms again. Users are able to filter and quality check their own data before using it. It also gives them a starting point if they don’t know how this data relates to the business.
The ADS provides our users with access to the historical records, the canonical records and the matched view of these records. They are able to start their analysis at any point and are able to jump between layers. They are able to relate data to each other as we keep record level references between layers. This makes our data platform extremely powerful. As new data sets are loaded into the canonical and matched layer, the data gets richer and richer. Users don’t have to learn a new schema each time a new source is added. Once they have learnt their way around the canonical model, they are able to analyse all data, which in turn allows them to drill down to the detail in the historical layer, in a targeted manner.
So back to my original question, is data science taking all the glory? Yes, I believe it is and that’s okay, especially if data scientists are making use of data platforms like the one I have described, to allow valuable insight to be realised. It means good data management practices are being applied – which is a pre-requisite for analytics to work.
Data Science in my opinion is still relatively immature, and to make sure organisations get the best return-on-investment, they really need to make sure their data is ready and fit-for-purpose.
I hope you have found this interesting. If you are looking for help with your MDM and Analytics platforms consider contacting Entity Group. This blog forms part of series run by colleagues within the Capgemini Data Science team, if you didn’t get a chance to read Vinod Verma’s blog on Agile Analytics Development you can find it here. In next week’s blog Matthew Thomson with be discussing analytics with Spark.
The views in this article are my own, and may not reflect the views of my employers at Entity Group. Connect with me on LinkedIn.
Some very good points and a perfect example of a use case for registry style MDM.
Stuart I think you make a valid point - the better the data management and governance the more useful the overall data platform is for our data scientists. Managing to match entities when you are faced with ever changing, often sparsely populated, semi structured XML type data feeds is a real challenge if sources like this are to contribute any benefit to the data scientists results. Whilst we can help to explain the benefit of such an approach at the outset of the project, it is only by the data scientists exploitation of the platform and their familiarity of use cases that we are able to qualify the benefits. Articles like this are really helpful in putting across best practice. The bottom line IMHO is that the efficacy of such an approach rests on the quality of the matched entities. And we need both the MDM experts and the data scientists to ensure that analytical business users are able to see for themselves the truth that resides in their data.
I respect the authors authority and experience in the field but I didn't get the punch line. Is data science getting more glory as opposed to MDM? I don't think so and I don't even think this is the right or a relevant question. MDM has been going as far I can remember for over 10 years and it is still gaining traction with a lot of companies. Yes Data Science (and Big Data) are the new kids on the block and it is the vast and cheap storage combined with the Internet and faster computing that has lead to this explosion in data everywhere. Statistical analysis (the basis of data science) has been around for far longer but this was only been applied in niche scientific or engineering fields. Now that data volumes and speeds have exploded businesses want to use these statistical techniques to basically improve their financial performance. So I don't think there's any glory being taking. Intel don't take the glory for the impact of their microprocessors on eCommerce for example nor do Facebook take any glory for chi design. Everyone has a job to do and people know that without microprocessors advances Twitter and Facebook would not be possible.
The first few paragraphs bored me ( sorry for being blunt) but by the time i neared half way mark of your blogpost , every word in it was invaluable. Fantastic post , enjoyed it.