Data Science: Getting a Feel for the Data

Data Science: Getting a Feel for the Data

I wanted to call this "The Meaning of Statistical Distributions" but that was an ugly title and didn't convey the lessons learned as a Data Scientist. 

I thought that many aspiring data scientists, and other curious people, would like a peek into how data science is done. This post is about the part of statistics not covered in class because it shows you how to develop a "feel" for the data and to understand the "meaning" of the analytic result. Most people can only hold a couple of things in their head, so what are the most important statistical distributions and how do you use them? What do they mean? And almost as important, what are the common pitfalls?

Statistics is the science of moving from a set of observations to a rule, as opposed to probability that predicts a result based on a rule. In order to be effective at teasing out the meaning of data it is important to be able to examine a set of data and guess its underlying meaning.

One of the most common is the sigmoid or "S" curve discovered by Pierre-François Verhulst in 1838 after reading Malthus' works on populations. Here is the classical form:

It's important for two reasons, first it describes things in nature and second it is non-linear. That means that whenever we have a "naturally occurring" distribution and someone describes it with a linear regression we must always ask why? Or on what interval is that valid? Without those things the linear regression is just a correlation between two number sets and will almost always fail given enough time. That is, we can approximate a curve with a straight line if the interval is short enough but the line does not describe the curve, and we will end up with erroneous results.

But where does this curve come from? What is it's meaning? It's a solution to the following differential equation.

If we think about it for a while, we will see that this is an equation that asks how does a function of x, f(x) vary with both f(x) and one minus f(x)? What happens when we have a function that is dependent on one minus itself? The population is self constrained, that's what. So in this example a population grows when f(x) is small and stops growing when f(x) is large. That gives us meaning and, more important, by recognizing where on the curve our dataset lays it gives us a much better idea of the future than thinking. “What we are experiencing now will continue forever.” It also allows us to recognize a business opportunity.

Early on, on the left side of the curve, the distribution is growing exponentially and is an indicator of new markets. Later, in the middle part of the curve, business markets grow in a predictable fashion. Later, the distribution's growth is slowing as the market becomes commoditized, and products compete on price alone. But, what happens in this later market when we remove the constraint? We create new markets. Using this technique a business can identify a path and timetable for switching products in a timely manner and can create value for the investor. A data scientist's job is to find a way to remove the constraint. A business person's job is to make that happen in the real world.

Doing nonlinear regression is hard, but there are clever mathematical tricks we can use to fix up our data so that linear regression can be used. For instance what if we multiply the above equation by its inverse?

That looks like a straight line which we could analyze using linear regression. So if our data points can be transformed via the inverse equation then we can map back and forth between the linear and non-linear regimes. This straightening trick is a very nice tool to have in your bag of tricks.  Often it is quicker than writing a new program to do non-linear regression.

In the case below we have made the data dependent on the S-Curve but on the interval 0-1 it is totally linear. If your local data scientist had run a linear regression, and you used it as a business plan, you will be in for a nasty surprise when the market changes and you have spent a lot of money tooling up and have to sell into a contracting market!

The second plot shows what would have happened if we had more data.

Uh-oh the market has going to slow down in two years.

Time to talk to R&D about the next product!  Let me know what you think, are these technical posts helpful?

John Ryan nice job. Would love to read more articles like this geared towards the layman description on data science. I enjoyed this one.

Like
Reply

To view or add a comment, sign in

More articles by John Ryan

  • From Dockside to Cortex: Music, AI, and Non-Conscious Response

    For years I’ve been interested in how behavior emerges before conscious thought. A lot of my earlier work looked at…

    2 Comments
  • The Eyes Have "It"

    There is a puzzle in the development of nerves, and therefore the nervous system, which has been plaguing me for years.…

    31 Comments
  • Lie to Me

    This post was inspired by Amy Blaschka and her post "Why People Tell Me Things" and the show "Lie to Me", which Amy is…

    26 Comments
  • The Trap

    Income inequality didn't happen by chance - it's built into our biology. We built our own trap out of our own sense of…

    98 Comments
  • The Joy of Sadness

    Why do we get sad? As a data scientist I am interested by the statistics of sadness that surrounds the holidays, as a…

    78 Comments
  • Solarpunks and Farmhacks

    Looking for a bright future? Try Solarpunk, a world in which power, food, work and transportation are localized…

    59 Comments
  • Non-Conscious Emotional Pathways in the Brain

    Here are excerpts from one of the best explanations of the non-conscious workings of emotion in the brain and how it…

    60 Comments
  • A Swale Weekend

    DIY - How To Dig Your First Swale The problem with most swales is that to build the first one you need a civil…

    29 Comments
  • Dear LinkedIn - You Broke Messaging

    I'm sorry to put it so bluntly but I can't use your messaging system and here is why. I am familiar with Email as are…

    107 Comments
  • Living Abundantly Here on Earth Using Evolutionary Stability

    No one likes being eco-friendly by taking cold showers and eating like a mouse, however if we will save the planet, we…

    57 Comments

Others also viewed

Explore content categories