The Missing Link: The Art of Data Science

For the last couple of years, the word of “Data Science” & the practitioners (popularly known as Data Scientists) seem to be the most fashionable commodity in the market. Like any other trend, it also has started becoming widely abused terminology. And that is exactly what raises a flag on whether the bubble is going to burst faster than it deserves.

Somehow the aura that has been created of late is, Data Scientists are the most powerful segment of the population, maybe second only to the politicians of course. But, one thing that one should not overlook is, with added power comes added responsibility. So, the breed of data scientists like us should be more responsible, take a deep breath & introspect on whether we are on the right track or not, before it is too late. One need to understand that lots or business value are at stake on our recommendations.

To make our discussion more interesting, let us first find out what could be some of the common traits of Data Scientists? How to identify them? Just going through the Job Description of Data Scientists on any of the portals, one can find them. 1. They must be strong in 20-30 different machine learning algorithms, and 2. Must be strong in open source like R & Python. The idea is to on board a few of these creatures, dump terabytes of data to them & they will take the organization to its next level. The most worrying part is, many of the data scientists themselves have started believing this way.

This is where lies the biggest disconnect. It is neither the algorithm nor the tool, not even the data scientist himself, who can outsmart the logic & understanding of the actual person in command of the process (or business) without the complete support of the respective personnel.

Here comes the “Art” part of data science. The Art of understanding the thought process of the individual who is managing the existing system, translate it into some business rules (it is definitely an art), explore some alternative algorithm which gives consistent better result (in some predefined business parameters), take a buy-in from all stakeholders (yes this one is also an art, note that all different stakeholders will have different vested interests and ego to be satisfied before they concur), implement it into the existing framework (the art of minimal effort with maximum return from the end user perspective) and sign off.

Well, the algorithms, the tools consist of the minimal part of the whole cycle, isn’t it? The hard fact is, yes, we need to appreciate that the sponsor of this kind of transformation initiative is looking at a multiplicative Return of Investment. For them, it does not really matter which algorithm & which tool is used, as long as one can give a decent ROI.

So, let us take the discussion on the Art part of data science forward. Well, it requires a lot of people skill & negotiation skills on the part of the data scientist to get the maximum out of the reluctant business team and that is absolutely person dependent. The better the consultant is, lesser the overall Turn Around Time is. Beyond that, there are some ground rules that every data scientist should follow, specifically when he is on the ground, dumped under terabytes of data. Let us set these ground rules first:

1.   Understand the domain, work very closely with the business. Acquire as much of knowledge about the system, domain & organization as possible at least from publicly available inputs.

2.   Play with the data for the maximum possible time. Data Science starts with the word “Data”, give it its due importance.

3.   Don’t be obsessed with the algorithm & programming. That is the least priority for the end user.

4.   Please don’t adopt any tool just because it is free. The short-term gain may lead to long-term devastation.

Let us now discuss each of these pointers in details:

•   Domain Understanding: Most of the data scientists have minimal exposure to the domains that they are assigned to work. Even if they have certain prior exposure to the domain, it is bare minimum compared to the team of individuals who are managing the show day in day out. So, it is very important to gain some prior knowledge before initiating any work. This is all the more important because the moment one enters into any discussion with the business team, both of them start talking in the same language. The acronyms that they use, the KPIs that they track, the way they handle the anomalies (and normal outcomes), all need to be noted closely. One should try adding value in the KPI tracking mechanism, show some data discrepancy. These small steps help a lot to win end consumer. That help them to open up & support in the initiative. If one can do this, he\she has won half of the battle even before one gets the data.

•   Communicate with the Data: We believe that the data is only meant to be flown into the floodgates of machine learning algorithm and report whatever result comes. Unfortunately, that is not the case. One should spend maximum time of the analysis period in the understanding of the data. In fact, each dataset wants to communicate with us, wants to tell us its own story. The true art of data science is to give patient hearing of what data wants to communicate. Then only one can diagnose the problem & prescribe remedial treatment. Interesting, right. Now, how does the data communicate? It communicates by providing rules, by providing anomalies, by providing patterns. One need to identify each of them as much as possible, validate with the SME in the business and plan next course of action. The action can be to recommend a pattern that was unknown to the business, the action can be to further drill down on certain anomalies & patterns, or the action can be to go to the next level of pattern recognition with multiple parameters together.

•   The Simplest Algorithm gets Maximum Brownie Points: Contrary to popularly belief, it has been proven beyond doubt that the algorithm or logic that is the simplest in nature gets the maximum acceptance. The reason for that is very simple, the simpler the algorithm is, the easier for the end users to comprehend & consume. Also, it is very important that one should always set a business benchmark (that is the business rule and the benchmark of the KPIs of interest for that rule) before jumping into own algorithm development. The most interesting part is, it is very difficult to outsmart these business rules significantly. Moreover, there is hardly any difference between the performance of the algorithms if the algorithms are developed with due diligence. So, the simplest algorithm that improves the existing performance benchmark with some degree of significance should be recommended. In a situation where we are not convinced whether the gap is significant enough or not, we should not recommend any algorithm & ask to continue the existing business rule. The interpretation is, this is the best possible alternative under current circumstances. If the algorithm is too difficult to comprehend, be rest assured that despite one’s best wishes, chances of the algorithm getting a true active life is minimal.

•   Penny Wise Pound Foolish: One thing that surprises me is the obsession of many data scientists about the programming language. They are really obsessed with the freeware like R & Python. With due respect to them & definitely acknowledging that it has its own set of advantages, primarily it saves the software cost a lot. But, I have my own doubts about whether we should at all recommend large scale implementations on these environments or not. The three major concerns are:

O Who will provide IT support for these tools? There are no dedicated support services.

O Why would one rely on a tool which, for same technique & same analysis provide different answers depending on the library one chooses. Please keep a note that every decision that we take in actual production scenario has a tangible business value attached to it. These inconsistencies can impact the reputation of the data scientist concerned as well as data science as a whole.

O Why would one take chances with a tool where newer version doesn’t support all previous functionalities? Consider a situation, A stable system running with 500 odd models and primarily related one particular algorithm. Then one new model using a newer technique is put into production and accordingly existing production environment is upgraded. By the next day, all existing 500 models fail because the library the functions were mapped to is non existent in upgraded version. So, we need to rebuild all models. Think of the business impact of such scenario. Why let that happen?

I am open to re-calibrate my thought process and accept my ignorance on the above concerns, maybe someone who has better exposure can correct me, but for the time being, let me stick to it.

To conclude, let me raise a question that is been consistently asked: How long will this craze of Data Science go on? Let me put an analogy here: Music Cassettes survived one decade at the peak, LP Records 2-3 decades at the peak, CDs 4-5 years at the peak. The longevity is reducing not due to the quality of the predecessor being better but because we are getting a better alternative in a shorter span of time. Going by the same logic if Analytics has survived over 2 decades at its peak, it won’t be surprising if data science wraps up in 5-8 years. Now, it is up to the breed of data scientists to ensure it is more towards 8 years than 5 years.

Good one…like to add only a very futile point...not a single business house would like to share their trade secrets, hence the role of Data Scientist is only to hunt for data patterns in the masked data, i.e., their activities are confined only with descriptive and predictive analytics, the prescriptive analytics part lies entirely in the court of SMEs...in that sense, Data Scientists themselves act as tools...

Like
Reply

Nice article. Well analysis. I fully agree that more time is required for analysis part. However I have little bit different opinion about the algorithm part you have highlighted, rather you low-lighted the algorithm part. This is not fully true I believe. It is equally important. Analysis is the phase where story and pattern to be extracted from data. It is very much important to extract correct relation/pattern within data. Here comes the knowledge of pattern recognition. Good analysis helps to understand about features and their relation. But to classify unknown data or even for unsupervised case, it is important which algorithm is being used. Linear algorithm may not produce correct result for non-linear pattern. Some algorithms do not work well on noisy data, many get trapped into local optimal and so on. It is also important to know the mathematics behind algorithm. Most of the classifiers use euclidean distance as default. However, similarity/dissimilarity measure plays an important role for classification. A particular measure does not work equally well in all kinds of data. Therefore, I think analysis part should also consider method as well. Valued output/result is possible only with combination of detail analysis of data, understanding of pattern/story behind data and correct selection of method/algorithm. It is my personal opinion. I also agree with your opinion about open source. Long term overhead is more as I have done analysis in many cases for overall long term benefit. Immediate capex is low, but down the line 3 years or more, overhead or even cost is more. Decision for this depends totally on case to case basis and how critical the application is.

Like
Reply

one of the pertinent pick...well written

Like
Reply

To view or add a comment, sign in

More articles by Sumanta Adhikari

Others also viewed

Explore content categories