On Discovery

One of the questions clients frequently ask me is: what is the key to a successful Big Data project? I tell them without hesitation: “assuming all other phases are properly handled, the discovery phase is the key to a successful deployment.”

The discovery phase is that initial stage of a big data project where a data scientist helps a client determine what is the best that can be done with the client’s data to help meet their business goals. Most descriptions of discovery mention “looking for key patterns” and/or “business goals.” The problem with such descriptions is that they ignore a key fact: most clients have no idea what they are looking for – and that’s here a good, creative data scientist has a key role to play – that of a tour guide to discovery!

How can you find what you are looking for if you do not know what you should be looking for? You cannot use patterns to search for key facts in your data if you do not know which patterns to use, nor do you even know what those key facts are that you should search for. If Big Data allows you to find needles in a haystack, how do you know that you should be looking for needles instead of, say, nails? You must determine what you should be searching for. But first you should understand the data deeply, and must envision what types of questions you can ask of the data that would result in giving your client a business advantage over the competition. And to do that, the data scientist must also understand the client’s business in depth.

A couple of examples may help explain the discovery process. A soda can manufacturer was replacing a decade’s old software process control system with a new one. They were turning the new system online and the old system offline on weekends to debug the new system without affecting production. Their problem was that this debugging and testing approach took months, even years, but they had used it successfully in the past and saw no other way to proceed. I was brought in as an enterprise architect consultant to assess the effort, but once I understood their problem I saw that the solution lay in mining their process data, and not in redesigning their enterprise architecture. I took off my architect’s hat and put on my data scientist’s one.

We needed to first understand the patterns in the transactions between the old process control system and all other systems it interacted with, all systems that had evolved together. Then we could use those patterns we discovered and look for where they were being broken or preserved when the new system was online. Transactions between systems are like conversations between people but, unlike human exchanges, systems follow a strict protocol – any unplanned for deviations and the conversation comes to an end.

I used a sequence mining algorithm to extract the transaction patterns from the old system and used those patterns as a reference to compare them with the new system conversations. Whenever there was a difference, I looked to see if the new conversation recovered or was terminated, and determined whether the new system or ancillary systems needed to be modified in some way before they could move on beyond the current point in their conversation. This method reduced their debug/testing time by 70%. Now I could concentrate on the enterprise architecture side of the problem.

More recently I had the opportunity to consult with a large corporate software company. They tend to sell certain groups of products from their catalog to their clients as part of a sales campaign. Their sales were off and they needed help improving their numbers. Also, their sales force exhibited a large amount of churn, thus affecting their client relationships and sales. They had the industry’s typical sales hit rate of between 10% and 20%. Thus 80% to 90% of their efforts were wasted, but they were used to those odds and baked them into their sales plans.

After cleaning up and examining their last 3 years of sales data, I looked for successful sales patterns based on location, industry type, rep experience and revenue. None of those criteria panned out. Finally, I proposed to test the idea that certain combinations of products sold well to certain types of clients while other combinations didn’t. Some products tend to help sell other products, while other products may prevent those sales.

To extract those product sales patterns, I used the same sequence mining algorithm that I used for extracting transaction patterns with the first client. I took the 3 years-worth of sales data and split it by date into training and testing datasets. Using the training dataset, I extracted the winning and losing product patterns. Next, I used those patterns to train a variety of machine learning algorithms to predict the success/failure sales rate of every pattern I found, testing each algorithm’s precision and recall against the testing dataset. Finally, I used a combination of the best algorithms to create a final classifier ensemble.

This final classifier gave me an average score of 0.55 against the testing data. This was not a great result, a 55% chance of a successful sale, but still a great improvement over the 10% to 20% sales hit rate the client was experiencing.

I suspected there was implicit data that we were not seeing, and spent an additional 3 weeks looking for it before finding it. I found important meta-data associated with the initial data. After repeating the training/testing exercise with the new metadata I found, I could increase the average ensemble to 0.75.  I could now take a list of my client’s sales leads and their associated products, score them and assure the client that if they focus on their top scorer leads they would find their new sales hit rate to be on average 75%! That was an average 7.5 to 3.75-fold increase in their sales hit rate. This result translated into tens of millions of extra sales/cross-sales and up-sales a year!

I will not go into the details of what I did during those final 3 weeks – after all, that’s how my clients and I make a living! But they represent the real value of a data scientist to a client. Those creative insights that happen mostly during the discovery phase do make a difference!

So the next time a client complains about spending more than a week in the discovery phase, please feel free to share these true anecdotes with them.

To view or add a comment, sign in

More articles by Joaquin Marques

Others also viewed

Explore content categories