Hiring for Data Science
What is a data scientist? Data science is a phrase that gets used an awful lot right now (late 2019) but it’s ill-defined. A few years ago, I was on the corporate data team at Hearst, which is a very large media company (one of the largest). Sometimes the work I did involved statistics, but mostly I was writing ETL jobs and owning the event pipeline, which received at least one event for every page load and ad served across hundreds of premier properties. We had ‘big data’ at Hearst: that event pipeline processed thousand of events per second, hundreds of gigabytes daily, and we had petabyte-order data in our data lake. At that scale, sometimes I needed to get creative to make things work; I was creating processes to provide real-time insights into very large volumes of data.
Did that make me a data scientist? According to my title that’s what I was. But I didn’t consider myself one; I used statistics sometimes but on the whole I was much more concerned with developing efficient algorithms and optimizing database performance. Eventually I advocated to have my title changed to Director of Data Engineering, because I was directing data engineers; the work we were doing could only be called ‘machine learning’ in a very broad sense (we used some logistic regression to predict targets sometimes), and we certainly weren’t concerned with artificial intelligence.
In fact, we worked alongside a team of data scientists, who were using machine learning and AI to do some interesting stuff. Why was my title ‘Data Scientist’ then? I believe it was because the corporate data team at Hearst was set up in order to, broadly speaking, monetize Hearst’s data through data science/machine learning/AI. That seems to be the somewhat vague but very real view that much of the world (at least the business world) has these days: data is valuable. Big data is untapped monetization. Data scientists unlock that, using machine learning and AI. You see this worldview encapsulated in the taglines for the many new graduate degree programs in data science/big data/ML/AI.
I don’t know how much flack I might get for this, but I’m going to question whether more than a handful of companies in this world need data scientists. This probably goes without saying, but I have nothing against data scientists, either personally or professionally; I have worked with a lot of data scientists over the past few years, and all of them are good at what they do. Pretty much every data scientist I have known has produced some interesting and useful work, sometimes sophisticated statistical models, sometimes detailed analyses, and sometimes fully realized products. Yet in almost every case, the analysis or product ended up being of little ultimate value to the business.
I would argue that this is because companies hiring data scientists are almost always attacking the problem from the wrong direction. There is a lot of hype around data science right now, and I think that a lot of companies hire data scientists because of the perceptions that (a) everyone else is hiring data scientists, and we need to do it too so that we don’t get left behind; (b) some combination of data science/AI/machine learning is the future, so we need data scientists so we don’t get left behind; and (c) there must be value locked into our data, so we need data scientists to unlock it– see my argument two paragraphs up.
I’d say that companies are hiring data scientists more out of a fear of missing out on the ‘data science value proposition’ than out of any concrete idea of how data science will benefit the business. It makes sense to think about this from the perspective of the people who are thinking this and making these hiring decisions. A lot of startups are hiring a data scientist these days; I’ve been at a couple of them, and I have seen what happens. One way it goes, is the founder has an idea that they think is disruptive, but it involves figuring out how to do this thing better than anyone else does it: it’s the ‘nobody is using machine learning to optimize X’ idea. Another way is investors putting pressure on the CEO: couldn’t you extract more value out of that process if you had a data scientist to make it more efficient? If we can get a 10% improvement on X then we’ll be better at it than anybody else and we’ll patent our methodology and make a killing. Yet another is that as a company hits its rapid growth phase, there’s a realization all of a sudden that we don’t have much insight into what’s driving our growth; we have some ideas because we know we’re growing, but we don’t know which customer segments we’re serving, which ones are churning out, etc.
I’m going to make separate arguments about each of these cases. In the first, the founder with ideas that sound disruptive but vaguely require ‘data science’ to execute… well just putting it that way I think it’s obvious that you need to do the leg work on figuring out how it’s going to work. If you think you can use cluster analysis to segment the market in a way no one else has done, you really need to demonstrate that. I suppose hiring a data scientist is one way to do your validation, but you’d best have a backup plan for when the validation doesn’t go your way.
In the second case, investor pressure, I guess it is up to the CEO to decide how best to push back, or figure out a way to do a pilot program. If there are optimizations or monetizable learnings that can be extracted by a data scientist, then a data analyst ought to be able to at least identify the opportunities. But I would advocate for hiring someone on a contingency basis, or a consultant, rather than a full-time data scientist, to run the pilot to see if you can extract real value. Or, handing the problem to your best data analyst or most stats-minded engineer, and seeing what they can do with it.
In the third case, sometimes a data scientist gets hired just because there are a lot of them floating around looking for work (if you think they are in short supply, just try posting a job listing for a data *engineer* and seeing how many applicants you get whose main ‘relevant’ experience is building deep learning models and recurrent neural networks). When this happens they don’t usually end up doing data science per se, which is maybe fine, as long as they’re happy acting as data analysts/data engineers, and as long as they have an aptitude for the work they’re actually doing (it’s a very different skillset!).
There are some companies that have made a business out of using a predictive model to do something that you can’t do otherwise, such as monitoring log data to watch for anomalies in real time. If you’re one of the companies that has used data scientists to build a business model, that’s fine. More power to you. You shouldn’t be held up as an example of what most companies should be doing. The point is, most companies are not basing their business on a machine learning model. And in practice, most companies are not going to be able to juice their revenue by 10% by optimizing processes through machine learning. Most business processes just don’t work that way. And even if you have a process, or two, that you can squeeze more margin out of through optimization through data science, is it enough of a delta to justify a data scientist’s salary? Are there enough optimization projects to keep generating positive ROI for a year of salary, or two or three? What do you do with your data scientist when you’ve squeezed all the extra margin you can?
So to come back around and make this into something useful, here are some points to consider if you’re thinking about hiring a data scientist.
- Data science projects must be carefully aligned with the business’ goals, or they will not contribute much value
- There have to be projects lined up in advance, and there has to be a clearly-defined and realistic expectation around how each project will lift the business (and how much it will)
- There should be a proper forecast done around the expected ROI on a data scientist, taking into account realistic expectations for how long it takes to complete a project, how many projects there are to complete, and scenarios around the likely lift from each one
- For each project consideration should also be given to the next-best alternative: can this be kind-of done by an engineer, or a data analyst with a bit of stats? Does it really need multiple weeks of research to find an appropriate model?
By the way, if that last question surprised you then you may not really want to hire a data scientist. Every data science project I have ever seen carried out begins with at least a couple weeks of research: reading up on different statistical models, doing a little testing of techniques, researching how similar problems have been solved in the past. Data scientists are, actually, scientists. Most of them have PhDs for a reason: that’s one of the ways people learn to do really robust research. As I asked at the top of this post, what is a data scientist? I’m not going to posit a definition, but you’d be well advised to come up with your own before you start the hiring process– and then hire to fit your definition.
Reposted from my blog at leadernode.com.