Demystifying Statistics: Random Variables and Distribution Functions
Image Source:hpreliability.com

Demystifying Statistics: Random Variables and Distribution Functions

Computation and analysis of data relies increasingly on statistical softwares. Machine Learning has revolutionized the way we understand data, which is great. However, in the process, we often lose sight of what is happening behind the scenes. While most people will be familiar with running a complicated regression model in STATA, for example, very few understand the actual mechanics behind it. On the other hand, theory generally tends to get extremely mathematical which again leads to the same problem. In this post and future posts, I intend to demystify certain statistical/econometric and economic concepts in very simple language as I understand it. This is an attempt to brush my own concepts and what better way to do it than to lecture an imaginary audience!! 

In this post, I will answer the question ‘What is Statistics about?’. In very simple terms, we have a given sample and a related statistic. The objective in Statistics is to infer something/test a hypothesis about the population parameter corresponding to the statistic. To explain this, we need to understand about the basic building blocks of Statistics – Random Variables and Distributions. 

What is a random variable?

A Random Variable is a function, which assigns unique numerical values to all possible outcomes of a random experiment under fixed conditions. A random variable is not a variable but rather a function that maps events to numbers.

Example:

Suppose that a coin is tossed three times and the sequence of heads and tails is noted. The sample space for this experiment is: S={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}. Now, let the random variable X be the number of heads in three coin tosses. X assigns each outcome in S a number from the set Sx = {0, 1, 2, 3}. The table below lists the eight outcomes of S and the corresponding values of X.

Outcome HHH HHT HTH THH HTT THT TTH TTT

X 3 2 2 2 1 1 1 0

X is then a random variable taking on values in the set Sx = {0, 1, 2, 3}.

Distribution Functions

A probability function of the random variable is known as its distribution. In the above example, the random variable X is defined as the number of heads. It can take values of 0,1,2,3. The probability of X=0, X=1, X=2, X=3 or in general, X=x can be calculated. In this case,

X 0 1 2 3

f(X) 1/8 3/8 3/8 1/8

The values that f(X) takes for each x can be calculated using a rule or a function which is called the probability function. This probability function f(x) is known as the pdf (probability density function) of the random variable X. If we plot the above data with the random variable X on the X-axis and f(X) on the Y-axis, we would get a bar graph which is the pdf. If X is a continuous variable, the bar graph becomes a histogram and for a large range of X, we would get a curve.

Now, there are three types of distributions

(i) Population distribution – theoretical distribution of the expected frequencies of population values

(ii) Sample Distribution – observed distribution of observed frequencies of sample values

(iii) Sampling Distribution – theoretical distribution of the values that a specified statistic of a sample takes on in all of the possible samples of a specific size that can be made from a given population

Corresponding to a population, we will have a set of outcomes which a particular variable can take. In a large population, we can expect the frequency of occurrence of each outcome of the variable. The variable and the expected frequency together can be plotted as a density function (or population distribution-how the values of X are expected to be distributed). For example, consider the variable income for a very large population. Typically, we do not have the actual income numbers of each and every person but we have an expectation of how the distribution of income would be in a large population. This is a theoretical distribution. This distribution is a whole array of numbers and can be condensed into a few numerical properties (descriptive measures) such as mean, standard deviation. Such numerical properties of a population distribution are called parameters. For a given population distribution, there will be one unique value for a given parameter.

If from a population, we take a sample, we actually observe the values that the variable takes for these sampled individuals. We can find the range of values and obtain the actual frequency in which each of these values occurs. The plot of the outcomes that the variable has taken in the sample along with the actual frequencies is the sample distribution. This distribution is also an array of numbers (though a smaller array) and can be described using some numerical properties called statistic. For a given sample, there will be only one value for a given statistic. 

If there is a population of size 1000, (say all students of a college) and say each student of the college is given an exercise to select a sample of 100 and calculate the mean mark of the sample. The sample that one person chooses will certainly be different from the sample that another person chooses. Accordingly, there will be 1000 means, one for each sample. If all possible samples are chosen from the population and the means are enumerated and we plot a distribution for these means, we get what is called the sampling distribution. Here, we got the sampling distribution of the statistic –mean. Similarly, there can be other statistics such as standard deviation. If there are two variables, we can have distributions for correlation co-efficient, regression co-efficient etc. If there are two samples, we can have distributions for the difference in means.

Note that, population distribution and sampling distribution are both theoretical distributions, that is, we have an expectation of how the variable would be distributed and not the actual distribution. The actual distribution we have is that of a sample from which we can calculate our statistic. Our objective in Statistics is to infer something or test a hypothesis about the population parameter given our sample and statistic. For example, in the above example, from the average mark of 100 people in my sample, can I infer something about the average mark of 1000 people in the population? 

So, that is really it! Statistics is all about trying to make an informed guess and all the fat books teach us how to make this guess as precise as possible. In the next post, I will explain how these distributions come to play in regression analysis. I hope this post is useful for those who do not have a background in Statistics. For others who found this post rudimentary but still managed to read it, thanks a lot! Comments and suggestions are welcome!

Wow. You started writing. Will read it and share my thought.

To view or add a comment, sign in

More articles by Ishwarya Balasubramanian

  • Is Mandarin really as difficult as you think?!!

    Wishing you all a happy new year! Just like everyone, I am hoping for a better year, but 2020 was not a bad year by any…

    8 Comments
  • Good Estimator, Bad Estimator!

    In my last post, I had mentioned that the first step in parameter estimation is to define a good estimator. A good…

  • Regression Analysis: Scope and Overview

    In my last post, I had discussed about the scope of statistics, which is to infer something about a population…

Others also viewed

Explore content categories