Minimalism in data work
Image by Isaac Smith from Unsplash (https://unsplash.com/photos/AT77Q0Njnt0)

Minimalism in data work

On my LinkedIn profile, it says that I talk about "data analysis, data science, business intelligence, ROI, and minimalism." If you're from the data world, then data analysis and data science are familiar, while ROI (return-on-investment) and business intelligence are well-known if you're from the business world. But the last term, minimalism, might require a little explanation, both as to why I talk about it, and, more importantly, how it can help your own work in applied data analysis.

Defining minimalism in data work

A single plant stem in a cup of water against a white background
Image by Sarah Dorweiler from Unsplash (https://unsplash.com/photos/2s9aHF4eCjI)

Minimalism means different things in different contexts. In visual art, minimalism can refer to paintings with solid colors and straight lines. (Think Agnes Martin and Ellsworth Kelly.) In sculpture, it can mean basic geometric shapes or repeated forms. (Think Sol LeWitt and Donald Judd). In music, minimalism can involve simple, repeated phrases and rhythms. (Think Meredith Monk and Terry Riley.) Minimalism also appears in theatre, fashion, backpacking, user-interface design, and other fields.

But here, in the data context, I'm using minimalism to mean a "minimally-sufficient analysis," or the least complicated analysis that will adequately answer your stakeholders' questions. Think spreadsheets, not neural networks.

I first learned about the importance of this principle from two related articles in psychology in the 1990s. (I am, after all, a research psychologist by training.)

The first article is a 1996 draft report by the Task Force on Statistical Inference Board of Scientific Affairs for the American Psychological Association, which says the following about "The use of minimally sufficient designs and analytic strategies":

The wide array of quantitative techniques and the vast number of designs available to address research questions leave the researcher with the non-trivial task of matching analysis and design to the research question. Many forces (including reviewers of grants and papers, journal editors, and dissertation advisors) compel researchers to select increasingly complex ("state-of-the-art," "cutting edge," etc.) analytic and design strategies. Sometimes such complex designs and analytic strategies are necessary to address research questions effectively; it is also true that simpler approaches can provide elegant answers to important questions. It is the recommendation of the task force that the principle of parsimony be applied to the selection of designs and analyses. The minimally sufficient design and analysis is typically to be preferred because: (a) it is often based on the fewest and least restrictive assumptions, (b) its use is less prone to errors of application, and errors are more easily recognized, and (c) its results are easier to communicate — to both the scientific and lay communities. This is not to say that new advances in both design and analysis are not needed, but simply that newer is not necessarily better and that more complex is not necessarily preferable. (pp. 3-4)

Then, in a related 1999 article in American Psychologist entitled "Statistical Methods in Psychology Journals: Guidelines and Explanations," Leland Wilkinson and the Task Force on Statistical Inference said the following about "Choosing a minimally sufficient analysis":

The enormous variety of modern quantitative methods leaves researchers with the nontrivial task of matching analysis and design to the research question. Although complex designs and state-of-the-art methods are sometimes necessary to address research questions effectively, simpler classical approaches often can provide elegant and sufficient answers to important questions. Do not choose an analytic method to impress your readers or to deflect criticism. If the assumptions and strength of a simpler method are reasonable for your data and research problem, use it. Occam's razor applies to methods as well as to theories. We should follow the advice of Fisher (1935):
Experimenters should remember that they and their colleagues usually know more about the kind of material they are dealing with than do the authors of text-books written without such personal experience, and that a more complex, or less intelligible, test is not likely to serve their purpose better, in any sense, than those of proved value in their own subject. (p. 49)
There is nothing wrong with using state-of-the-art methods, as long as you and your readers understand how they work and what they are doing. On the other hand, don't cling to obsolete methods (e.g., Newman-Keuls or Duncan post hoc tests) out of fear of learning the new. In any case, listen to Fisher. Begin with an idea. Then pick a method. (p. 598)

It should be obvious that these same principles are important not just for academic research, but for applied work in organizational settings, as well. For example, the suggestion that one should "not choose an analytic method to impress your readers or to deflect criticism" also means that one should not choose a method to impress your manager or the audience at your speaking engagement. Again, the idea or the question comes first; everything should follow from that.

Minimalism in data software

Apple Macintosh Plus computer against a white wall
Image by Federica Galli from Unsplash (https://unsplash.com/photos/aiqKc07b5PA)

When it comes to minimalism in the tools used to work with data, I'd like to share some personal recommendations. Your own choices will depend on the nature of the work you do, your organizational environment, and, of course, your personal preferences. Consequently, this is not a dogmatic list, but a set of possibilities. It's how I do my personal work, and how I do most of my collaborative work and consulting, as well. But I think these recommendations fit well with the spirit of "minimalism in data work."

My general advice is to prioritize simpler methods and only move on as necessary. So, start with spreadsheets, until it is no longer easy to do your work there; then move to data apps, until it is no longer easy to do your work there; and, finally, move to data-oriented programming languages, which should be able to do whatever you need. But, whenever possible, start with spreadsheets. Let me explain myself.

First, spreadsheets. Despite the undeniable fact that most of the marquee work in data science has involved data sets that are both enormous in size and enormous in complexity, it is my personal (but untested) belief that the majority of the world's datasets, at a headcount level, exist in spreadsheets. I have hundreds of spreadsheets on my own computer, and every single data project I have ever done in my own academic work and my professional consulting has involved spreadsheet-compatible – if not necessarily spreadsheet-sourced – datasets. I have consulted with billion-dollar companies and global organizations that used nothing but spreadsheets.

Spreadsheets are to data like air and water are to life.

I realize that telling data analysts and data scientists to use spreadsheets is not exciting. It is the rough equivalent of your doctor telling you that walking for 30 minutes a few times a week will probably do more for your health than purchasing thousands of dollars of exercise equipment and signing up for elaborate group fitness classes, especially if that exercise equipment goes unused and those fitness classes go unattended. But it's also similar to the relationship advice you may have heard before: small but simple acts of kindness, consistently done, can have a greater accumulated impact than occasional grand gestures. Spreadsheets are the small and simple part of data work that can lead to great insights.

But I should still point out some of the advantages of starting with spreadsheets when possible:

  • Ubiquity. Spreadsheets (as data) are everywhere, as are spreadsheet applications. Micro Biz Mag claims that over a billion people have Excel on their computers. Acuity Training conducted their own poll of office workers and found that two-thirds of office workers use Excel at least once per hour, and that over one-third of their time at the office was spent using Excel. And this is to say nothing of the nearly-universal available of Google Sheets and other online spreadsheet applications.
  • Flexibility. Based on my thirty years of professional experience, I believe that the vast majority of operational datasets are either created in spreadsheets or can easily be translated to them. The rows and columns of a spreadsheet can bring order to almost anything.
  • Capability. Spreadsheets make it very simple to organize data, sort and filter values, fix errors, and make basic graphs like bar charts, line charts, and scatterplots. They are great for browsing data and getting an intuitive feel for what's there and where you should start your analyses.
  • Transportability. Basically any data application or language can read data from a spreadsheet. Chances are that your clients will give you data in a spreadsheet, and they may want the results in a spreadsheet, as well. Spreadsheets function as the lingua franca of the data world.

[As a note, I created a free video course on Google Sheets, including exercise files, that can be accessed at datalab.cc or on YouTube. You can also see it as a single, three-hour long video on freeCodeCamp 's YouTube channel. If you have access to LinkedIn Learning, there are many excellent courses on Google Sheets and Microsoft Excel. If you don't have your own subscription to LinkedIn learning, you may be able to access is through your employer, your university, or your public library, as many provide free access to LinkedIn Learning]

Second, apps. But, with those things in mind, spreadsheets can't do everything. Some data tasks are difficult to do in spreadsheets, and some are impossible. As a result, you can reach the point of diminishing returns sooner than later if you're trying to do, say, a cluster analysis or a lasso regression. In those cases, you'll want to take another approach that provides less resistance, and use a statistical application instead. In my field, academic social and behavioral sciences, that often means SPSS, but it can also be SAS, Stata, JMP, and so on. My personal favorite is the free, open-source application jamovi, which resembles SPSS but runs on R. (JASP is a closely-related, open-source program that is particularly strong in Bayesian methods.) All of these are data-oriented applications that include graphical user interfaces, or GUIs, that make it possible to conduct analyses by clicking on menus. These are very capable applications that can accomplish nearly everything that, say, a marketing analyst or an academic researcher might need. They are often the tool of choice for people whose work includes data analysis as one element but does not focus exclusively on data work. They are also excellent tools for collaborating with people who need insights from data but whose professional specialties lie elsewhere.

[Note: You can see my free video courses on SPSS and jamovi at datalab.cc or on freeCodeCamp 's YouTube channel, with SPSS here and jamovi here. And, in LinkedIn Learning, there are several courses on SPSS, including my own SPSS Statistics Essential Training.]

Third, languages. But, as with spreadsheets, statistical applications can also reach a point where the particular data tasks needed are no longer easy or no longer possible. For example, SPSS is able to do rudimentary neural networks based on multilayer perceptron (MLP) and radial basis function (RBF) procedures, but it can't do the more powerful deep learning neural networks. Also, most of the applications are proprietary and can be extraordinarily expensive for professional use. (As an aside, a colleague of mine once questioned whether it was professionally ethical to train students on software that they would not be able to afford when they left the university. I believe that's an important point, and it guides my recommendations here and elsewhere.) Fortunately, the most common and most powerful programming languages for data, R and Python, are both free and open-source:

  • R is a language developed specifically for data work, and is common among applied researchers, such as economists, medical researchers, and marketing researchers. (And, as a person with a background in Psychology, it's the language I use most often for projects that require more than spreadsheets or jamovi.)
  • Python is a general-purpose that has been well-adapted to data work. It's common in computer science and machine learning engineering. If you're doing natural language processing or neural networks on a regular basis, chances are that you're doing those in Python.

People get stuck in debates about the relative merits of these two languages, which isn't very productive, because the capabilities of R and Python overlap extensively. Also, if you're going to be a professional data scientist, then you probably need to be able to work fluently with both R and Python, and many other languages, as well. If you're an applied researcher, it probably works best to simply use whichever is more common in your particular field, as that will help you get domain-specific guidance and maybe even professional opportunities.

[Note: I also have a free video course on R at datalab.cc or on freeCodeCamp 's YouTube channel. LinkedIn Learning also has a wide assortment of courses on these languages and others, including my own courses Learning R, R Essential Training: Wrangling and Visualizing Data, R Essential Training Part 2: Modeling Data, Data Science Foundations: Data Mining in R, and Data Science Foundations: Data Mining in Python, among others.]

But, remember, when it comes to data software, start small, then scale as needed:

  • First, spreadsheets
  • Second, apps
  • Third, languages

[A personal historical note: The computer at the top of this section is an Apple Macintosh Plus, one of Apple's iconic "compact Macs." I earned three graduate degrees while doing research on the closely-related Macintosh Classic II, with its nine-inch, black-and-white, 512 × 342 screen – about the same number of pixels as an Apple Watch – along with 4 MB of RAM and an 80 MB hard drive, both of which were upgrades. My advisor worked with punch cards when she was in graduate school. Technology evolves exponentially, but excellent work has been done – and can still be done – with (apparently) limited hardware. I record videos on a professional-level Mac with multiple, large monitors, but I'm writing this article on a Chromebook because I love its simplicity. FYI.]

Minimalism in analytical methods

Bar chart on printed page
Image by Giorgio Tomassetti from Unsplash (https://unsplash.com/photos/QCbZ4ASLhM8)

Perhaps you're familiar with the song "Conjunction Junction" by Schoolhouse Rock. It is a song (and cartoon) that, as one would expect, describes conjunctions, or words and phrases that connect ideas. While there many, many conjunctions used in English (see, for example, the long list compiled by 7ESL.com), "Conjunction Junction" says "you've got 'and,' 'but,' and 'or,' they'll get you pretty far." That is, just three basic conjunctions will serve most of your needs.

The same idea applies in data work: a relatively small set of procedures from the very large list of possibilities can provide a surprising amount of insight, and this is what makes the minimalist approach to data work so fruitful. Let's look at some of the simplest and most productive methods in three areas: data visualization, descriptive measures, and statistical modeling.

Data visualization. In any projects where humans are conducting analyses and evaluating results – that is, any project except, possibly, large machine learning projects that are automatically implemented – the first step should be to visualize the data, so you as the analyst can get a feel for the data, including the common values, the exceptional cases, the possible errors, and clues for how to start.

But I'd like to give a little bit of personal context first. Before I switched to psychology and data, I studied design. Visualization is very appealing to me; I have shelves full of books that show amazing, creative ways to depict data, such as the foundational volumes by Edward Tufte and the beautiful, handmade work in Dear Data by Giorgia Lupi and Stefanie Posavec (which was then acquired by MoMA, the Museum of Modern Art in New York City). I love the 60 different categories of graphs described in The Data Visualisation Catalogue, which you can read online for free. One of my first video courses for LinkedIn Learning showed how to program custom data visualizations from scratch using the creative coding language Processing. I love the artistic side of data visualization and the amazing techniques that people have created for depicting complex data.

But, as much as I like the wildly creative side of data visualization, when it comes to my own day-to-day analytical work, I find that I use a small number of visualization options over and over:

  • Bar charts, which are great for showing the number of cases in different categories, as well as single descriptive statistics like the mean or percentage of positive cases for different groups.
  • Histograms, density plots, and box plots, which are useful for checking the shape of a distribution for a quantitative variable, as well as checking for outliers.
  • Scatterplots, which can show the relationship between two quantitative variables, such as years of education and quality of life.
  • Line charts, which are the first choice for looking at changes in a value over time, such as new customers or likes on social media.

I believe that this handful of visualization methods will answer about 90% of analytical questions. Once, when I was teaching a group of professional data analysts, I joked that:

Bar charts and line charts are free. Anything else requires supervisor approval.

(That rule even applies to my list above, as I usually only use histograms, density plots, box plots, and scatterplots as part of the data exploration process, which guides the analysis, but is not usually shared with the stakeholders. When it comes to the final presentations, I typically include only bar charts and line charts.)

Descriptive measures. After creating a series of visualizations to get a better understanding of the data and how it can help answer the stakeholders' questions, the usual next step is to compute a variety of descriptive statistics. Basic descriptive measures, such as the mean, median, mode, the range, and the standard deviation, are well-known to anyone in the data world, so I won't dwell on those here, except to say that you need to be clear why you chose the measures you did.

In addition, first-order associations between variables can be explored through correlation coefficients. However, if you're planning on looking at a large table of correlations, make sure that your dataset is sufficiently large to keep the risk of false positives in an acceptable range. This is usually not a problem when dealing with data from online sources like social media, where you may have millions of data points, but it is a chronic concern in labor-intensive projects, such as those frequently found in medicine and the social sciences.

A good way to reduce the risk of false positives, and to reduce cognitive load overall, is to reduce the dimensionality of your dataset. That is, after conducting, for example, a principal component analysis (PCA) or factor analysis, you may be able to combine groups of variables into a smaller number of composite scores. Not only are these composite scores fundamentally easier to deal with, as there are fewer of them, they also tend to focus more on the concepts of interest by averaging out the idiosyncratic errors of the individual variables. Dimensionality reduction is a somewhat advanced topic, so it's a good idea to get some specialized training, but it can go a long way towards the goal of minimalism in your data work.

Statistical models. Finally, when you start the work of building statistical models of your data, you have an overwhelming number of choices. I want to recommend, however, that you always include linear regression as one of your choices. Linear regression has been around for a long time, as far as statistical modeling procedures go, and has several important strengths, such as:

  • Flexibility. Linear regression can handle almost any kind of data: quantitative variables, rank order variables, and nominal categories, as well as data measured at different levels of abstraction (e.g., multilevel or hierarchical models that look at employees within companies, companies within markets, markets within geographical regions, and so on).
  • Diagnosability. Linear regression is a well-understood procedure, so when something goes wrong – which can happen with any analytical approach – it's often easier to diagnose and remedy the problem with regression than with other approaches.
  • Interpretability. Simple linear models, where the data are all at the same level and there are no interaction coefficients in the model, are extremely easy to interpret. Part of this is because standardized regression coefficients can be directly compared to each other, as well as to intuitive standards of smaller and larger effects. And, depending on the source of your data – in particular, whether it is observational or experimental – you may be able to derive actionable next steps from your regression analyses.

Taken together, these suggestions for data visualization, descriptive measures, and statistical modeling can help simplify your analyses and make it easier to identify the key points of value. Your work will be easier and the relevance to your stakeholders' concern will be clearer. That's a win-win.

[If you have access to LinkedIn Learning, you can see my recently-updated course called "Data Fluency: Exploring and Describing Data," which covers several of these methods, as well as many other excellent courses that cover related topics.]

Next steps

Data analyst presenting at whiteboard
Image by ThisisEngineering RAEng from Unsplash (https://unsplash.com/photos/TXxiFuQLBKQ)

Minimalism in data work is not a goal in and of itself, but a way of approaching data work that can make it easier for you to find the important, actionable insights in your data without getting distracted by issues that don't add direct value. I can make a few specific suggestions on how you can incorporate productive minimalism in your own work:

  1. Make sure you are sufficiently comfortable working with spreadsheets so that you can quickly address simple questions right where the data is already located.
  2. Emphasize data visualizations, with a special focus on bar charts and line charts, particularly for your final presentations. Make sure to direct attention through your charts one step at a time as you explain your findings.
  3. When appropriate, include linear regression models as one way of modeling your data, with special attention to the interpretation and application of your results.

Remember that data analysis is always an exercise in both simplification and translation. Embracing a minimalist approach to your data work will make both of those tasks more efficient and effective for you, and more relevant and meaningful for your stakeholders.


Thanks for joining me here. And remember, sharing is caring! Follow me on LinkedIn and share this  newsletter with a friend who you think would benefit from it.

Such a rich article, full of wisdom...thanks for sharing!

Like
Reply

Thank you for this article. No matter if one is in an academic or industry meeting, there will be people in the room that are not aware of the details of the technical skills used, and if a simpler solutions answer the questions, they are easier to explain... less can be more in some circumstances.

Rich information from just one compact article. This is very useful, particularly for my field where I have to share results with a wide range of stakeholders including the most marginalised in society.

I love this! Almost like Occam's Razor but for analysis method selection. While I want to learn more about things like network analysis and machine learning in Python - for my current job those things are not yet necessary compared to the bread-and-butter of spreadsheets. This is very practical advice. Thank you!

Very interesting. You open up my mind to wanting to know more about this. Thanks for sharing 👍

To view or add a comment, sign in

More articles by Barton Poulson

  • Data Poetics, vol. 02

    This is the 2nd collection of my "Data Poetics" posts. My motivation for this project was to remind data people that…

  • Data Poetics, vol. 01

    It's my firm belief that, although the technical aspects of data work are necessary conditions for meaningful data…

    5 Comments
  • AI for the 99%

    I lived most of my life in the previous millennium. I've been around for many of the tech developments that have…

    3 Comments
  • Upside down and on fire but successful

    We all live by metaphors. One of my favorites comes from a stock car race that I saw many years ago.

    3 Comments
  • Return on Investment (ROI) for data people

    I grew up in Los Angeles, with MGM Studios (now Sony Picture Studios) at one end of my street and 20th Century Fox at…

    1 Comment
  • The limits of advice

    Novelists, biographers, playwrights, composers, and choreographers – and so many others – often face the impossible…

    7 Comments
  • Why I embrace accommodations

    I am grateful that I teach at an open-enrollment university, Utah Valley University. Every student who applies will…

    2 Comments
  • The symbol vs. the thing

    In 1931, Alfred Korzybski, who developed the field of general semantics, famously said "the map is not the territory."…

    3 Comments
  • In praise of DIY data work

    There is a fundamental paradox to data work: It feels like an exact science, because it has rows and columns of…

    1 Comment
  • 3 themes for data work worth doing

    Richard Wagner, the 19th century German composer, revolutionized opera in many ways – in scale, structure…

    1 Comment

Others also viewed

Explore content categories