Data Scientist: Full Stack Problem Solver
I wanted to take the opportunity to present a new definition of a data scientist: a “full stack problem solver”. I conduct a lot of job interviews for data scientists, take a fair number of informational interviews for aspiring data scientists, read a lot of questions on Quora and elsewhere from aspiring data scientists, and I’ve read the curricula from data science programs. While I see some forms of convergence, I still find a remarkable lack of clarity regarding the discipline and it is for this reason that I propose this new definition. As a “full stack problem solver”, I mean to say that the data scientist is a horizontal and a vertical multi-disciplinarian.
Most analyst positions - those outside of government or bloated bohemoth corporations - require horizontal multi-disciplinarians. Whether in a business intelligence group or in vertical, functional roles, these analysts require a relatively broad view of the company. Even as relatively low-level tacticians, these individuals think and operate strategically to “look at the big picture” and “put the pieces together”. Yet from a vertical perspective, I will argue, many analyst positions are tightly restricted.
From the analyst perspective, there is data, there is interpretation, and there is effective communication to an internal customer. For the analyst, this is sufficient. What’s more, finding people who can perform this role well is extremely difficult and these people provide enormous value. What’s more, many of these analysts have strong skills retrieving data, using some flavor of SQL and sometimes even writing queries on top of HDFS with Hive or Apache Pig. The limitation we come to, I believe, is that for most analysts, the data that exists in a company’s data warehouses]represents for the analyst the sum of reality, “all that there is”, when it is in most cases anything but.
The data that analysts typically rely on tends to be highly structured and heavily curated data. On the whole, this is a valuable arrangement, as analysts tend to be heavily involved in high-frequency, quick turn-around operational projects where having that structure and clarity commands a high premium. Nevertheless, what what we gain is not costless; in consideration for that premium, we must sacrifice, and that sacrifice is typically paid in information loss. Just as we gain a simple, easily interpretable summary statistic when we take an average, we achieve this at the expense of the variance found in the full distribution of the data from which the average was computed.
Beyond the aggregation performed in constructing a data warehouse, myriad routine operational decisions are made that limit the ultimate availability and fidelity of the data of interest. Engineers make decisions about what data to log, other engineers what information needs to be passed between services “just for tracking purposes”, and then yet others decide what data is valid, based off of time attribution, “good IPs”, or “illegitimate users”. All of these decisions are fundamental not only to what the data is, but also embody the universe of possibilities about what the data mean - especially when operational metrics begin to inexplicably misbehave.
A massive focus of data science has become about predictive analytics/machine learning/statistical modeling in the past few years. I do not mean to downplay those skills, but I do believe they are just another arrow in the quiver of the data scientist. Moreover, they’re something of a difference in degree: more sophisticated modeling. The ability to understand and ultimately shape that critical but unseen part of the stack defines the crux of the data scientist’s unique contribution. By capitalizing on this particular contribution, the capstone in the full analytical stack, we distinguish ourselves - as a discipline and as individuals.
P.S. If you were wondering, yes - choosing which stack of pancakes to use as the photo for this piece took an enormous amount of time.
Very interesting article. Clarified the role of data scientists and also the distinction between data scientists and data analysts. Thanks
“Looking at the big picture” while “putting small pieces together” - Well said! :)
Full stack problem solver. I like that. I think another way to describe what a Data Scientist does is comparing data to oil. In many ways Data Scientist turn data into something of value to people and organization. Your thoughts?
Thank you for articulating my myriad of jumbled thoughts into something I can share and communicate!