Minimalism in data work
On my LinkedIn profile, it says that I talk about "data analysis, data science, business intelligence, ROI, and minimalism." If you're from the data world, then data analysis and data science are familiar, while ROI (return-on-investment) and business intelligence are well-known if you're from the business world. But the last term, minimalism, might require a little explanation, both as to why I talk about it, and, more importantly, how it can help your own work in applied data analysis.
Defining minimalism in data work
Minimalism means different things in different contexts. In visual art, minimalism can refer to paintings with solid colors and straight lines. (Think Agnes Martin and Ellsworth Kelly.) In sculpture, it can mean basic geometric shapes or repeated forms. (Think Sol LeWitt and Donald Judd). In music, minimalism can involve simple, repeated phrases and rhythms. (Think Meredith Monk and Terry Riley.) Minimalism also appears in theatre, fashion, backpacking, user-interface design, and other fields.
But here, in the data context, I'm using minimalism to mean a "minimally-sufficient analysis," or the least complicated analysis that will adequately answer your stakeholders' questions. Think spreadsheets, not neural networks.
I first learned about the importance of this principle from two related articles in psychology in the 1990s. (I am, after all, a research psychologist by training.)
The first article is a 1996 draft report by the Task Force on Statistical Inference Board of Scientific Affairs for the American Psychological Association, which says the following about "The use of minimally sufficient designs and analytic strategies":
The wide array of quantitative techniques and the vast number of designs available to address research questions leave the researcher with the non-trivial task of matching analysis and design to the research question. Many forces (including reviewers of grants and papers, journal editors, and dissertation advisors) compel researchers to select increasingly complex ("state-of-the-art," "cutting edge," etc.) analytic and design strategies. Sometimes such complex designs and analytic strategies are necessary to address research questions effectively; it is also true that simpler approaches can provide elegant answers to important questions. It is the recommendation of the task force that the principle of parsimony be applied to the selection of designs and analyses. The minimally sufficient design and analysis is typically to be preferred because: (a) it is often based on the fewest and least restrictive assumptions, (b) its use is less prone to errors of application, and errors are more easily recognized, and (c) its results are easier to communicate — to both the scientific and lay communities. This is not to say that new advances in both design and analysis are not needed, but simply that newer is not necessarily better and that more complex is not necessarily preferable. (pp. 3-4)
Then, in a related 1999 article in American Psychologist entitled "Statistical Methods in Psychology Journals: Guidelines and Explanations," Leland Wilkinson and the Task Force on Statistical Inference said the following about "Choosing a minimally sufficient analysis":
The enormous variety of modern quantitative methods leaves researchers with the nontrivial task of matching analysis and design to the research question. Although complex designs and state-of-the-art methods are sometimes necessary to address research questions effectively, simpler classical approaches often can provide elegant and sufficient answers to important questions. Do not choose an analytic method to impress your readers or to deflect criticism. If the assumptions and strength of a simpler method are reasonable for your data and research problem, use it. Occam's razor applies to methods as well as to theories. We should follow the advice of Fisher (1935):
Experimenters should remember that they and their colleagues usually know more about the kind of material they are dealing with than do the authors of text-books written without such personal experience, and that a more complex, or less intelligible, test is not likely to serve their purpose better, in any sense, than those of proved value in their own subject. (p. 49)
There is nothing wrong with using state-of-the-art methods, as long as you and your readers understand how they work and what they are doing. On the other hand, don't cling to obsolete methods (e.g., Newman-Keuls or Duncan post hoc tests) out of fear of learning the new. In any case, listen to Fisher. Begin with an idea. Then pick a method. (p. 598)
It should be obvious that these same principles are important not just for academic research, but for applied work in organizational settings, as well. For example, the suggestion that one should "not choose an analytic method to impress your readers or to deflect criticism" also means that one should not choose a method to impress your manager or the audience at your speaking engagement. Again, the idea or the question comes first; everything should follow from that.
Minimalism in data software
When it comes to minimalism in the tools used to work with data, I'd like to share some personal recommendations. Your own choices will depend on the nature of the work you do, your organizational environment, and, of course, your personal preferences. Consequently, this is not a dogmatic list, but a set of possibilities. It's how I do my personal work, and how I do most of my collaborative work and consulting, as well. But I think these recommendations fit well with the spirit of "minimalism in data work."
My general advice is to prioritize simpler methods and only move on as necessary. So, start with spreadsheets, until it is no longer easy to do your work there; then move to data apps, until it is no longer easy to do your work there; and, finally, move to data-oriented programming languages, which should be able to do whatever you need. But, whenever possible, start with spreadsheets. Let me explain myself.
First, spreadsheets. Despite the undeniable fact that most of the marquee work in data science has involved data sets that are both enormous in size and enormous in complexity, it is my personal (but untested) belief that the majority of the world's datasets, at a headcount level, exist in spreadsheets. I have hundreds of spreadsheets on my own computer, and every single data project I have ever done in my own academic work and my professional consulting has involved spreadsheet-compatible – if not necessarily spreadsheet-sourced – datasets. I have consulted with billion-dollar companies and global organizations that used nothing but spreadsheets.
Spreadsheets are to data like air and water are to life.
I realize that telling data analysts and data scientists to use spreadsheets is not exciting. It is the rough equivalent of your doctor telling you that walking for 30 minutes a few times a week will probably do more for your health than purchasing thousands of dollars of exercise equipment and signing up for elaborate group fitness classes, especially if that exercise equipment goes unused and those fitness classes go unattended. But it's also similar to the relationship advice you may have heard before: small but simple acts of kindness, consistently done, can have a greater accumulated impact than occasional grand gestures. Spreadsheets are the small and simple part of data work that can lead to great insights.
But I should still point out some of the advantages of starting with spreadsheets when possible:
[As a note, I created a free video course on Google Sheets, including exercise files, that can be accessed at datalab.cc or on YouTube. You can also see it as a single, three-hour long video on freeCodeCamp 's YouTube channel. If you have access to LinkedIn Learning, there are many excellent courses on Google Sheets and Microsoft Excel. If you don't have your own subscription to LinkedIn learning, you may be able to access is through your employer, your university, or your public library, as many provide free access to LinkedIn Learning]
Second, apps. But, with those things in mind, spreadsheets can't do everything. Some data tasks are difficult to do in spreadsheets, and some are impossible. As a result, you can reach the point of diminishing returns sooner than later if you're trying to do, say, a cluster analysis or a lasso regression. In those cases, you'll want to take another approach that provides less resistance, and use a statistical application instead. In my field, academic social and behavioral sciences, that often means SPSS, but it can also be SAS, Stata, JMP, and so on. My personal favorite is the free, open-source application jamovi, which resembles SPSS but runs on R. (JASP is a closely-related, open-source program that is particularly strong in Bayesian methods.) All of these are data-oriented applications that include graphical user interfaces, or GUIs, that make it possible to conduct analyses by clicking on menus. These are very capable applications that can accomplish nearly everything that, say, a marketing analyst or an academic researcher might need. They are often the tool of choice for people whose work includes data analysis as one element but does not focus exclusively on data work. They are also excellent tools for collaborating with people who need insights from data but whose professional specialties lie elsewhere.
[Note: You can see my free video courses on SPSS and jamovi at datalab.cc or on freeCodeCamp 's YouTube channel, with SPSS here and jamovi here. And, in LinkedIn Learning, there are several courses on SPSS, including my own SPSS Statistics Essential Training.]
Third, languages. But, as with spreadsheets, statistical applications can also reach a point where the particular data tasks needed are no longer easy or no longer possible. For example, SPSS is able to do rudimentary neural networks based on multilayer perceptron (MLP) and radial basis function (RBF) procedures, but it can't do the more powerful deep learning neural networks. Also, most of the applications are proprietary and can be extraordinarily expensive for professional use. (As an aside, a colleague of mine once questioned whether it was professionally ethical to train students on software that they would not be able to afford when they left the university. I believe that's an important point, and it guides my recommendations here and elsewhere.) Fortunately, the most common and most powerful programming languages for data, R and Python, are both free and open-source:
People get stuck in debates about the relative merits of these two languages, which isn't very productive, because the capabilities of R and Python overlap extensively. Also, if you're going to be a professional data scientist, then you probably need to be able to work fluently with both R and Python, and many other languages, as well. If you're an applied researcher, it probably works best to simply use whichever is more common in your particular field, as that will help you get domain-specific guidance and maybe even professional opportunities.
[Note: I also have a free video course on R at datalab.cc or on freeCodeCamp 's YouTube channel. LinkedIn Learning also has a wide assortment of courses on these languages and others, including my own courses Learning R, R Essential Training: Wrangling and Visualizing Data, R Essential Training Part 2: Modeling Data, Data Science Foundations: Data Mining in R, and Data Science Foundations: Data Mining in Python, among others.]
Recommended by LinkedIn
But, remember, when it comes to data software, start small, then scale as needed:
[A personal historical note: The computer at the top of this section is an Apple Macintosh Plus, one of Apple's iconic "compact Macs." I earned three graduate degrees while doing research on the closely-related Macintosh Classic II, with its nine-inch, black-and-white, 512 × 342 screen – about the same number of pixels as an Apple Watch – along with 4 MB of RAM and an 80 MB hard drive, both of which were upgrades. My advisor worked with punch cards when she was in graduate school. Technology evolves exponentially, but excellent work has been done – and can still be done – with (apparently) limited hardware. I record videos on a professional-level Mac with multiple, large monitors, but I'm writing this article on a Chromebook because I love its simplicity. FYI.]
Minimalism in analytical methods
Perhaps you're familiar with the song "Conjunction Junction" by Schoolhouse Rock. It is a song (and cartoon) that, as one would expect, describes conjunctions, or words and phrases that connect ideas. While there many, many conjunctions used in English (see, for example, the long list compiled by 7ESL.com), "Conjunction Junction" says "you've got 'and,' 'but,' and 'or,' they'll get you pretty far." That is, just three basic conjunctions will serve most of your needs.
The same idea applies in data work: a relatively small set of procedures from the very large list of possibilities can provide a surprising amount of insight, and this is what makes the minimalist approach to data work so fruitful. Let's look at some of the simplest and most productive methods in three areas: data visualization, descriptive measures, and statistical modeling.
Data visualization. In any projects where humans are conducting analyses and evaluating results – that is, any project except, possibly, large machine learning projects that are automatically implemented – the first step should be to visualize the data, so you as the analyst can get a feel for the data, including the common values, the exceptional cases, the possible errors, and clues for how to start.
But I'd like to give a little bit of personal context first. Before I switched to psychology and data, I studied design. Visualization is very appealing to me; I have shelves full of books that show amazing, creative ways to depict data, such as the foundational volumes by Edward Tufte and the beautiful, handmade work in Dear Data by Giorgia Lupi and Stefanie Posavec (which was then acquired by MoMA, the Museum of Modern Art in New York City). I love the 60 different categories of graphs described in The Data Visualisation Catalogue, which you can read online for free. One of my first video courses for LinkedIn Learning showed how to program custom data visualizations from scratch using the creative coding language Processing. I love the artistic side of data visualization and the amazing techniques that people have created for depicting complex data.
But, as much as I like the wildly creative side of data visualization, when it comes to my own day-to-day analytical work, I find that I use a small number of visualization options over and over:
I believe that this handful of visualization methods will answer about 90% of analytical questions. Once, when I was teaching a group of professional data analysts, I joked that:
Bar charts and line charts are free. Anything else requires supervisor approval.
(That rule even applies to my list above, as I usually only use histograms, density plots, box plots, and scatterplots as part of the data exploration process, which guides the analysis, but is not usually shared with the stakeholders. When it comes to the final presentations, I typically include only bar charts and line charts.)
Descriptive measures. After creating a series of visualizations to get a better understanding of the data and how it can help answer the stakeholders' questions, the usual next step is to compute a variety of descriptive statistics. Basic descriptive measures, such as the mean, median, mode, the range, and the standard deviation, are well-known to anyone in the data world, so I won't dwell on those here, except to say that you need to be clear why you chose the measures you did.
In addition, first-order associations between variables can be explored through correlation coefficients. However, if you're planning on looking at a large table of correlations, make sure that your dataset is sufficiently large to keep the risk of false positives in an acceptable range. This is usually not a problem when dealing with data from online sources like social media, where you may have millions of data points, but it is a chronic concern in labor-intensive projects, such as those frequently found in medicine and the social sciences.
A good way to reduce the risk of false positives, and to reduce cognitive load overall, is to reduce the dimensionality of your dataset. That is, after conducting, for example, a principal component analysis (PCA) or factor analysis, you may be able to combine groups of variables into a smaller number of composite scores. Not only are these composite scores fundamentally easier to deal with, as there are fewer of them, they also tend to focus more on the concepts of interest by averaging out the idiosyncratic errors of the individual variables. Dimensionality reduction is a somewhat advanced topic, so it's a good idea to get some specialized training, but it can go a long way towards the goal of minimalism in your data work.
Statistical models. Finally, when you start the work of building statistical models of your data, you have an overwhelming number of choices. I want to recommend, however, that you always include linear regression as one of your choices. Linear regression has been around for a long time, as far as statistical modeling procedures go, and has several important strengths, such as:
Taken together, these suggestions for data visualization, descriptive measures, and statistical modeling can help simplify your analyses and make it easier to identify the key points of value. Your work will be easier and the relevance to your stakeholders' concern will be clearer. That's a win-win.
[If you have access to LinkedIn Learning, you can see my recently-updated course called "Data Fluency: Exploring and Describing Data," which covers several of these methods, as well as many other excellent courses that cover related topics.]
Next steps
Minimalism in data work is not a goal in and of itself, but a way of approaching data work that can make it easier for you to find the important, actionable insights in your data without getting distracted by issues that don't add direct value. I can make a few specific suggestions on how you can incorporate productive minimalism in your own work:
Remember that data analysis is always an exercise in both simplification and translation. Embracing a minimalist approach to your data work will make both of those tasks more efficient and effective for you, and more relevant and meaningful for your stakeholders.
Thanks for joining me here. And remember, sharing is caring! Follow me on LinkedIn and share this newsletter with a friend who you think would benefit from it.
Such a rich article, full of wisdom...thanks for sharing!
Thank you for this article. No matter if one is in an academic or industry meeting, there will be people in the room that are not aware of the details of the technical skills used, and if a simpler solutions answer the questions, they are easier to explain... less can be more in some circumstances.
Rich information from just one compact article. This is very useful, particularly for my field where I have to share results with a wide range of stakeholders including the most marginalised in society.
I love this! Almost like Occam's Razor but for analysis method selection. While I want to learn more about things like network analysis and machine learning in Python - for my current job those things are not yet necessary compared to the bread-and-butter of spreadsheets. This is very practical advice. Thank you!
Very interesting. You open up my mind to wanting to know more about this. Thanks for sharing 👍