Being a Data Scientist
This article is a brief summary of my own manifesto that is designed to reflect on what data scientists do and some best practices they abide by to avoid biasness from creeping into the data mining process. I have structured the article in the four stages that we would normally expect in a prototypical data science project. For each stage, I also started each point with either a maxim, question or an ethical commitment. I hope you will find it useful.
Problem Formulation
Data scientists decompose a business problem into subtasks. This is one of the hallmarks that distinguishes a data scientist, that is, practicing enough prudency in understanding what subtasks can be formed, and if these subtasks can be broken down further into smaller subtasks. More often than not, solutions to subtasks can lead to human action through potential business applications. Decomposition also enables the data mining process to be more efficiently carried out without having to revert to the problem formulation stage repeatedly, “reinventing the wheel” as Foster Provost and Tom Fawcett puts it in their book: "Data Science for Business introduces the fundamental principles of data science".
“How can I reframe this as a prediction task?” Data science is always aimed at producing solutions that are accurate and useful, and as such data scientists employ supervised techniques over unsupervised ones wherever possible. This way, formulating a business problem becomes clearer through breaking it down further and asking relevant questions like “what other attributes are important towards attaining an accurate target result?” This inevitably impacts the following stages in the pipeline and determines whether communicating the end result is well-judged to be explicable to stakeholders (who may not possess the basic quantitative skills to comprehend models easily).
“The world doesn’t hand you models” Not only do data scientists have the ability to determine how a business problem can be broken down into smaller pieces, they are also able to correctly identify how each of these smaller pieces can be matched to a specific data mining algorithm. This highlights the importance of creativity here, but it also involves experience and failing along the way.
Data Collection & Cleaning
“...results of large samples deserve more trust than smaller samples...” Data scientists understand the importance in being careful during a decision-making process that is reliant on a small data set. The insights derived from a small sample might not necessarily give an accurate account of what is actually observed. Furthermore, there is higher likelihood of extremities in small when comparing to large samples of data. Therefore, when analysing data, Data scientists proceed their work at this stage, mindful of the bias that lurks to avoid making false conclusions.
“...find trustworthy raw data and do your own analyses to learn a “truer truth.”” Data scientists follow this rule religiously. They conduct several evaluations on the data, especially if it is given by error-prone humans. This step is vital towards mitigating any potential biases that lurk within it. Data scientists do not underestimate exploratory data analysis - They do it often and appreciate it.
“We are pattern seekers.” Although innate in us, built by the processes of evolution to survive, Data scientists do not settle on causal explanations in order to look for relationships where there is none. It is a vital part of analysing modelling results to determine if there truly exists bi- or multivariate relationships within the data. It is after all, an ethical responsibility to produce non-misleading results.
"For sake of replication and future-proofness." Data scientists have the data cleaning process and modelling results documented at every step in a standardized manner (E.g., through version control system). With documented steps, Data scientists can execute the same scripts on another identical and larger dataset easily. They understand that this step inculcates a culture of quality data in data mining projects of an organisation, which lends to higher overall productivity.
Data scientists make their codes comprehensible. It is impossible to remember all discussions or processes in a data mining project. Not only that, but very often a project may require scaling and can involve other departments with non-technical backgrounds to be added into the process. Data scientists understand that it is always a good practice to add footnotes and organized pieces of code so that it is readable regardless of individual domain expertise.
Data Analysis and Modelling Stage
Data scientists get around the “black-box” problem. They understand that ethical data science practices entail not just being able to interpret the outputs of a model but the inner functioning as well. For any misapplied black-box models on sensitive data such as financial, health, hiring, etc, the results and insights drawn from the models can invariably lead to unforeseen consequences that harm societies. They are able to anticipate that some models amplify biases and must strike a balance between complexity and interpretability.
“Is there a simpler model that I can pick?” Data scientists often ask themselves this when they decide on a modelling technique to use on the data. One of the main ways to reduce likelihood of overfitting is to constrain the complexity of the model. For instance, if one is using neural networks, and the results are showing poor accuracy on the test dataset, he or she should consider removing layers to make the network smaller.
Early stopping is effective and simple, do it often. As the model is evaluated iteratively through a large number of training datasets and as the model’s performance starts to degrade, Data scientists put a halt to the training process. With k-folds validation set, training should be stopped at the point of the smallest error, which typically happens with overfitting as very often there is a decrease at first, then followed by an increase as the model overfits. Early stopping helps minimise generalisation error, thus decreasing the chance of overfitting.
Data scientists avoid the problem that comes with multiple testing. They make a well-informed decision about which hypotheses to test instead of testing all x number of hypotheses. Although testing more times seems like a more thorough examination of the data, it also comes with more erroneous occurrences (false-positive results) which will only complicate the analysis process even more. It is easy to fool oneself and others if there are no adjustments made for multiple testing. Data scientists understand that multiple comparisons are justified so long as they disclose it as exploratory and document the process. However, they also replicate themselves using a new sample (the test data set) and include the replication in the same paper, to ensure robustness of their findings. In other words, they do not report results that are just a “function of point and click statistics.”
Recommended by LinkedIn
“Data scientists do not give in to the HARKing temptation.” HARKing stands for Hypothesizing After Results Are Known, which as the name implies, goes against the principle of scientific research. An analyst is HARKing when he/she is testing a hypothesis, after an analytical study has already been made, as if it was set out from the beginning. Data scientists understand that HARKing increases the likelihood of committing a type 1 error. Additionally, this unethical practice produces a wastage of resources such as time and money since more replicated studies, with no true effect, are made. Even if they need to publish something, they do not change their hypotheses post hoc in hopes of producing project results that are merely eye-catching but are based on wrong insights.
“What are some possible confounding variables?” Data scientists always consider any variable Z that is causing an association or correlation between variables X & Y. If they can pin down Z, they can report that the relationship between X & Y is just merely a spurious correlation. Only after controlling for Z, if they still observe a statistical significance between X & Y, then Data Scientists make a conclusion of a possible causal relationship. To control for confounders like Z, they can conduct randomized controlled experiments, multivariate regression analysis and causal inference assumptions backed by logic, business acumen and domain knowledge.
“...it can easily bewilder the statistically naive observer.” Data scientists are armed with knowledge in statistics, and hence do not always assume that statistical relationships are immutable. The strength of statistical relationships is more often than not, affected by controlling for a third variable (covariate). Simpson’s Paradox reminds them that causal interpretations should be made with caution. Effectively, one can resolve Simpson’s Paradox by stratifying the data to control for a third confounding variable. Just looking at the aggregated data alone can obscure the true relationship between the variables being studied.
Data scientists never condition on a collider. Doing this will introduce bias when estimating the correlation or relationship between variables of interest. This will result in mistakenly concluding an association between the two variables when in fact, there are none.
Data scientist know how to deal with the third variable Z, whether it's a collider, mediator or a confounder, and the difference between all three depends on the direction of influence. A confounding variable Z is one that causes both X & Y (condition on it!). A mediator Z is one that causes mediation between X & Y, such as being caused by X which in turn, causes Z to influence Y (condition to get direct effect!). A collider variable Z is caused by both X & Y (as above maxim, never condition on it!)
Presenting and Integrating into Action
“All great truths begin as blasphemies.” Presenting data science results fleshes out human biases innate in us, and in particular, status-quo bias in decision-making. Even with convincing evidence backed by sound reasoning, logic and substantial data, human nature can often get in the way by resisting changes in habits, or business leaders or management resisting changes in processes. One way that can help Data scientists negate this is to narrate and communicate their insights that can spark emotions that relate to the audience so that they may follow along the story and become emotionally invested. Indeed, visualisations will help significantly. Understanding such cognitive limitations allow Data scientists to frame their presentation slides differently and more effectively.
Data scientists do not bore their audience with minor details. They focus on what is important and keep it simple and straightforward. Any typical audience has a short attention span. Realizing this, Data scientists use the limited time focusing on the most important messages they want to convey. As with Professor Paul Resnick of UMSI, "You do not have to bring your audience along the same journey you went through in a data science project."
Data scientists think probabilistically as a matter of principle. It is the bedrock of statistics that we cannot prove anything to a 100%. Hence, Data scientists communicate this through margins of error in the model’s output, not just to be reliable and honest but doing so also manages expectations of stakeholders.
“Play enough poker hands, and you’ll make your share of royal flushes.” This reinforces the above maxim, in that, Data scientists have to be open to a range of possible values including extreme outcomes. In other words, they will always deploy a range of possible values or outcomes to the audience when presenting their results, and not rule out the possibility of extreme predictions. Leaving room for a multitude of possibilities will help the business anticipate outcomes.
Decision-making also requires knowledge of the uncertainties. In any scientific process, especially one as dynamic as the business world, Data scientists communicate not just their results with varying degrees of certainty but also any uncertainty relating to the insights gathered. It is important to communicate all information, rather than bits and pieces of information that they feel confident about because incomplete information can have the potential to mislead the audience, and this can be detrimental.
Data scientists do not always make assumptions that lower the level of uncertainty. Uncertainty isn’t always an enemy and it is better to disclose and communicate uncertain outcome, causes, variables, etc right at the outset rather than minimising uncertainty just for the sake of minimising uncertainty. Reducing uncertainty by making bolder assumptions does not always improve a model’s accuracy.
“Data scientists will always aim to make friends with the Machine Learning Engineers”. It's often unclear how to ensure that an implementation actually works properly. Before putting the model into production, it is desirable that the machine learning engineers and data scientists plan on how to deal with different types of scenarios of failure. Machine learning engineers cannot possibly foresee all the possible solutions to every facet of a complicated task that come with efficient scaling and debugging.
Data scientists are generally humble about how much work they have done. Sitting at the intersection of data science and software engineering, machine learning engineers are the ones who will use their technical expertise to put the models data scientists create into production. They feed data into the models and transform these models to production-level models that will handle real-time data. This, invariably, requires extensive technical skills, time and resources and oftentimes, this is much more work.
Data scientists are always not quick to toss their results of an analysis “over the wall”. Even when they are in urgency to move onto a next problem, simply handing over statistics to another department without clearly communicating an important insight can render the insight useless. Management can potentially lose out on an opportunity to act and this can mean losing out on cost-savings, opportunities to other revenue streams, etc. Hence, proactive communication on a Data scientist's part for all insights is key towards a value-added data mining solution.
Thank you for reading. Please reach out to me if you're interested in the articles that have helped me derived the above maxims, questions and ethical statements.
Images by Dilbert by Scott Adams (https://dilbert.com/)
Well done , Michael! 👍
Good stuff Michael..looks like you have got all right from SIADS-501 Being a Data Scientist :)
Well done, Michael!
Great piece, Michael!