Data 101: Hammers and Nails
This is the first in a series of articles I intend to write distilling my (somewhat) recent academic history regarding data and analytics into a tractable format for both data professionals, and data consumers. The goal of this is to share how I, coming from a primarily academic background, approach problem solving using the tools of data science and analytics, and how they can be applied to the real world. Some of the content will be concepts that I have shared in classes I have taught, while others will be mental models and methods that I have relied upon in my own work. I hope that some will find this content useful and informative, and that through this journey, I will be able to learn as well.
A Running Analogy: Building a house
When asked by students (or potential employers) how I would describe what data science is, I like to fall back on the analogy of building a house. I will refer back to this analogy (and at times stretch it a bit) throughout these articles. Here I will build the foundation (see, I already started) of my story of data analytics.
A home or a hovel
The initial stage is figuring out what the question or problem you are trying to solve is, are you building a house, an apartment, or a store. Each of these questions or problems requires a different set of materials, plans, and tools to solve. In the world of data, knowing what type of question you are asking is the primary motivator for everything that follows. Are you trying to predict customer behavior, identify choke points in production, or understand risk factors in new markets? The set of methods and data required to answer these disparate questions are as different as choosing to build a structure out of steel or wood, and fastening them with rivets or nails.
One ring doesn’t rule them all
Enter my favorite quip, to a man with a hammer, every problem looks like a nail. When you boil down data science, all of the approaches, methods, and algorithms are simply tools. Yes, you theoretically could build a whole house with just a sledgehammer, nails, and wood, but it would be ugly, inefficient, and difficult. Saws may be of use to cut down boards for example (instead of trying to beat them until they break). Data analytics is much the same. You could use a machine-learning algorithm and terabytes of data to figure out how many widgets you need to sell to break even this quarter, but the same question can be answered with a simple statistical model and a small sample of your data, it will be quicker, cheaper, and sufficiently accurate. Knowing which set of tools and data is appropriate to answer the question at hand is one of the most important aspects of data analytics. There is no magic method that will accurately and efficiently answer every question or problem your company might face.
It takes a village to raise a [barn]
Crossing some metaphors and sayings gets us to the next crucial step in solving a problem, identifying the members of your team. Now, in academia, much of my work was done flying solo, but when I was running or involved in group projects, the results were much better, because every individual has a particular set of skills they are most adept at. Looking at the slew of data-centric jobs out there, I am often struck by the lack of communal agreement on what each title means and does. My perception and explanation here will reflect my bias coming from academia, and may not represent the actual designations, but who knows, perhaps they should.
Building a house requires a group of people who have distinct and particular roles, backgrounds, and training. The plans for the building are designed by an architect (a team effort in the data world). The materials (data) need to be chosen and sourced (Data Analysts), warehoused, and transported to the site (Data Architects and Engineers), where upon the construction can begin (Data Scientists). All the time the General Contractor (Project Manager) is in the background making sure that things are on spec and on budget. Once the construction is finished, the sales team (Data Analysts) take the product, polish it up, put pretty furniture and flowers in (data visualization), and show it off to the new home owner (CXO/client) and prove to them that this house is the answer to their problem.
Calling a spade a spade, or sometimes a shovel, maybe a manual ground penetration and removal device…
Can all of this be accomplished by a single individual/role? Absolutely, given the right individual. However, I would argue that the process is accelerated and improved by having a team that can focus their attention on what they do best. One of the big differences between what is the “accepted” role distinctions of Data Scientists and Data Analysts, and what I call them, is where they fit in the pipeline. I have heard Data Analysts described as “junior Data Scientists” doing small sample statistics, while Data Scientists do Big Data, and create new algorithms. While this may be true in practice, I think that the role of what I call a Data Analyst is to guide the methodological approach used by, and subsequently interpret the evidence provided by, the Data Scientists, and translate that in to actionable insights. I would argue that an Analyst thinks in a different way than a Scientist, and that both roles are necessary to take a problem and provide a solution.
It’s all Geek to me
The whole point of this gobbledygook is that someone who has a problem, and doesn’t know how to get that answer themselves, has asked a data professional to figure it out, and get back to them. They don’t care how you got the answer, as long as it works (they care a lot if it doesn’t work though). What they care about is being given information that they can do something with. This is where two pearls of wisdom which were imparted to me come in, and I will share here.
The first comes from my policy analytics background. Often times, a policy analyst may have a 30-minute briefing with a politician/policy maker cancelled due to whatever it is politicians/policy makers have to do that is more important. We were trained, therefore, to give the so-called elevator speech. The idea is, in lieu of your briefing, you have 45 seconds (approximately the duration of a ride in the elevator) to convey your analysis clearly and concisely to the client. Aside from the practical use of this technique, if you have a meeting significantly curtailed, the inherent wisdom is that if you can’t make a convincing case for your analysis in under a minute you either don’t fully understand it, or you lack clarity and concision. In both cases, drawing it out for a full 30 minutes wastes about 29 of them. Take time to practice your elevator speech to help ensure that not only your client will understand your message, but that you do.
The second comes from one of my professors who has a real job outside of academia in the consulting world. His message was simple, explain it to me in words my 87 year old grandmother would understand. The idea is similar to the elevator speech. Data science is loaded with jargon, both from the industry itself, but also in terms of the reasonably high level statistical and mathematical methods we employ. By and large the individual who is consuming the information either doesn’t know all the math (otherwise they could do it themselves, why pay you) or doesn’t have the time to listen to all your equations and significance tests etc. Translating the “Geek Speak” in to plain English ensures that you are able to communicate your message clearly and understandably to the client, and that you yourself better understand what it is you are claiming.
An answer in search of a question
Let’s say that you own a construction company, and in the process of building a bunch of houses, you have managed to stockpile a surplus of materials. This might be a good opportunity, during the off season, to invest some of your profits into a side project. Dig through your materials and see if you’ve got enough to build a new type of house. It may sell, it may fall down, or it may languish for years before an interested buyer comes along. Building on spec is akin to jumping in to your data without a defined problem ahead of time. It may pay off, it may even pay off big, but it carries with it inherent risk in terms of sunk costs of storing the potentially unnecessary data, investing time and labor in an exploratory project, and coming up with answers to problems that are impractical to solve.
Students of mine (as well as myself on occasion) often fall in to this trap, where, instead of forming a hypothesis, and then collecting and analyzing the data to support (or not support) the hypothesis, they would find a handy data set and squeeze it until something looked significant enough to write a paper on. Sometimes, however, all the squeezing in the world won’t net anything of value, and you find yourself three quarters of the way through your project timeline with no progress being made, no clear question being asked, and the very real prospect of having to start from scratch with only a quarter of the amount of time that you originally had. If you have a big enough budget and team, then it may be worth the risk. But if you are on a tight budget, investing in blind data mining may not be a good initial approach to using your data (and money) wisely.
The take away
- Identify the question/problem first
- Select the appropriate data and methods/models to answer the question at hand.
- Build a team that will be able to carry the project from planning to delivery, it’s not necessarily about getting the most perfect answer, it’s about getting the most practical and achievable answer for the given problem
- Be able to clearly and concisely communicate that answer in to a format the client can understand and act upon.
- Work within the constraints of reality with regards to risk vs. reward. Hoarding data and searching for answers to imaginary problems is a potentially costly endeavor with no guaranteed payoff
Peter, I also find myself constantly making house analogies to coworkers and customers. Not sure if any of our professors ever made them, but there seems to be a CGU connection. I find that your analysis rings true even for those who don't hold the titles or perform the full time roles you've described. There are two ways in business to roll with the new possibilities of data production, collection, analysis, and presentation: devote full time resources or ask capable individuals to use their own toolboxes to address these problems inside their current role. Regardless of industry, title, or responsibility, we are all attempting to do more with everything we are collecting to tell a better story/make our case/argue for resources/etc. and your post here speaks to all, not just the professionals. p.s., I too rely upon the 87 year old grandma approach given to us. Works like a charm.
I enjoyed your article. Thank you.