Data Scientist: the last profession to resist the machine?

Data Scientist: the last profession to resist the machine?

There's a steady stream of reports projecting which jobs are most vulnerable to substitution by machine - or to invoke hyper-Luddite fears, we might say 'artificially intelligent machine'.

Is the role of the data scientist vulnerable?

By 'data scientist', let's be generous and refer to individuals who will design and build algorithms that ultimately automate cognitive processes that we today associate with higher-order skills or functions in the economy (assumption being that machine learning and big data will continue to be principal ingredients for this).

Researchers speculate that analysts, lawyers, accountants, even doctors are high at risk. Is the data scientist really immune?

A sharper question to start: is the role of a data scientist commoditizable? By commodtizable, lets mean the output of the role becomes precisely defined such that the market will purvey it with a narrow band of variation in quality. Implying that having a top-notch data scientist (whatever that will come to mean) does not bring hugely disproportional gains.

I would claim that everything that is commoditizable in the modern economy is going to be subject to substitution by machine (I'm open to hearing counter-examples). If we can understand why and how the role of a data scientist will resist commoditization, we understand its best defences against the machine. Our question is now: can these defences hold?

Some of the warning signs:

* Much of a data scientist's time is spent cleaning and organising data: 'pre-processing'. These are not higher order skills. 

We can imagine how they can be performed by lower-skill individuals, and eventually fullly automated.

* Much of the heavy-lifting of running advanced algorithms is performed by open-source libraries and packages (e.g., the famous sci-kit learn in Python). 

An 'executing' data scientist does not have to design new methodologies and figure out a complex set of instructions to implement - many advanced algorithms run with several lines of code, copied, pasted with a little tweaking of inputs and parameters.

* Today, data scientists operate like software developer roles (i.e. coding in Python/R from an IDE). Software developers were once in short-supply, but once supply caught up to demand, its fair to say they have become commoditized: the IT outsourcing wave of the 1990s was consequence.

So what are the defences, and how will they resist the machine. Five in particular:

1) Interpreting machine-led feature engineering: feature engineering is understanding how to manipulate variables to extract the best signal relevant to the particular problem being solved. Machines do exist (and will only get better) that can cycle through a large range of possible feature combinations before finding which perform well. Subsequently, the intelligent observer is left with the task of identifying signals with legitimate correlation, from the flukes. You can argue that the intelligent observer can come from business roles (i.e. not a trained data scientist). However, features are constructed in mathematical combinations - e.g. multiplications showing interaction and the use of log and exponent functions - these need communication (rather translation) for all the implications to be understood by business leaders. As much as this is such a management platitude (!), I'm forced to concede: great communicationn will be critical - see also point 3.

2) Problem-algorithm matching: a higher order function is matching the algorithmic approach to the particular business problem being solved. 

In theory this could be done by a systematic decision-tree process - answer a set of questions on your problem, and the rule-set or machine tells you what to use. 

However, often some innovation is needed to get the best match; many algorithmic approaches require tweaking of the parameters or understanding how to combine a set of algorithmic steps to create a fully-functioning system. 

Rather than analogies with web or app developers, a better commparison is with the UI/UX designer - excellent UI/UX design has a non-linear impact on likelihood of success of a product, and the variation among talent remains wide.

In theory, a machine could cycle through all the potential different approaches, and spurt out all inputs-outputs permutations, also critically list the assumptions and implications. 

However - these assumptions and implications are going to be in mathematical and statistical language, a role is needed to bridge the understanding of the real world problem with the implications for the maths and stats that underline the learning process.

3) Unpacking the box: businesses will need to understand how the algorithm has functioned, and what insights are there from the process. The data scientist is armed

with the knowledge of how the algorithm has run. With that knowledge the data scientist is best placed to communicate to the business. Great communication resists commodotization. Finer elements are how to temper insights, how to communicate the uncertainties and certainties of the statistical learning processes.

Further the actual interpretation of what has happened and what can be learned is a meta-process to the actual running of the algorithm. It is undoubtedly a higher order skill; it is also not a skill for which that there's any legitimate line of sight as to how machines will perform this. Knowing where and how to find insights.

A well-known quip is that data science is torturing the data till it confesses - attributed to Nobel-prize winning Economist Ronald Coase (author of the famous Coase theorem about the proliferation of markets). 

And if so, then we do need good torturers to ensure we get legitimate confessions (the deterioration in the signature of the putative arsonist Guy Fawkes post his interrogation comes to mind - if only machines could betray their deteriorated signatures before we trust their conclusions!)

4) The practitioner in combining: big data, small data, no-data and made-up data. See my upcoming post on this.

5) Measuring performance of the system. These systems will be judged by ultimate performance against conventional business impact metrics (for which there are abundance of roles claiming prowess here). Intermediate metrics are needed to understand how the system is performing, and for diagnosis of what is happening.

Choosing those metrics effectively does require an understanding of the algorithms. The data scientist is best placed here. Until machines understand the full context of business problems and combine this with a knowledge of how algorithmic systems perform (both inconceivable right now), the data scientist role has her seat secure.

The quest for true human-replacing AI has still time to go. Like all technology projections, we are likely myopic without fair view of the medium-long term.

So with that disclaimer now made, let me leave assured that the defences of the data scientist are impervious to AI.. for now.

(Of course, we still need to argue that all other professions do not have similar defences - lets leave that for a follow-up post).



To view or add a comment, sign in

More articles by Avi Patchava

Others also viewed

Explore content categories