Understanding probabilistic reasoning and statistical modeling in a non-technical way

While training people on Analytics and Data Science, I have found that people face lot of difficulties in understanding the concepts. Hence I have tried to write on this in the most non-technical way so that I can inculcate in others an attitude of Statistical and Probabilistic thinking.

If we look at the classical definition of probability, it is defined like this. If a random experiment can produce 'n' mutually exclusive and equally likely outcomes, and if 'm' out to these outcomes are considered favorable to the occurrence of a certain event A, then the probability of the event A, denoted by P(A), is defined as the ratio m/n.  If we refer to our high school mathematics, we would have solved all the flipping coin, throwing dice and playing cards problems using this definition. Knowingly or unknowingly we have made assumption that all the outcomes are equally likely even though it may not be the case. For example in real life a dice might have been manipulated in such a way that getting 5 would be more likely than getting other numbers. We can only know this if we actually perform the experiment several times.  Technically, we can say we are completely uncertain about the prior distribution of the outcome, hence uniform assumption is the best choice. But let us forget about it for the time being.   We can only say that this definition is NOT applicable when the assumption of equally likely does not hold. This definition becomes vague when the possible outcomes are INFINITE in number, or uncountable as well.

Now let us use relative frequency approach to define probability. The essence of this definition is that if an experiment is repeated a large number of times under (more or less) identical conditions, and if the event of our interest occurs a certain number of times, then the proportion in which this event occurs is regarded as the probability of that event. For example, we know that a large number of students sit for the matriculation examination every year. Also, we know that a certain proportion of these students will obtain the first division, a certain proportion will obtain the second division, and a certain proportion of the students will fail.  Since the total number of students appearing for the matriculation examination is very large, hence, the proportion of students who obtain the first division can be regarded as the probability of obtaining the first division and the proportion of students who obtain the second division can be regarded as the probability of obtaining the second division, and so on. As such, this definition is very useful in those practical situations where we are interested in computing a probability in numerical form but where the classical definition cannot be applied. Numerous real-life situations are such where various possible outcomes of an experiment are NOT equally likely.

Hence it is pretty clear now that as far as quantifiable probability is concerned, in those situations where the various possible outcomes of our experiment are equally likely, we can compute the probability prior to actually conducting the experiment .Else, as is generally the case, we can compute a probability only after the experiment has been conducted (and this is why it is also called ‘a posteriori’ probability).

Now things become very interesting. Let us take an example where a student is asked about his confidence about passing an examination, which he or she will appear for the first time. If we go by the equally likely approach it will be 50%. Then our question is vague and we are not taking his belief into account and we do not have evidence to adopt the second definition to calculate the frequency of his passing the exam before he appears for it. Non-quantifiable probability is the one that is called Inductive Probability. An important point to be noted is that it is difficult to express inductive probabilities numerically. But it should not restrict the student from expressing his belief on a scale even though we do not have any evidence to prove or disprove it. Hence, the most formal definition of probability is the axiomatic definition of probability.

This definition, introduced in 1933 by the Russian mathematician Andrei N. Kolmogrov, is based on a set of AXIOMS. Let S be a sample space with the sample points E, E, … E, …E. To each sample point, we assign a real number, denoted by the symbol P(Ei), and called the probability of Ei, that must  satisfy the following basic axioms:

Axiom 1:             

                For any event Ei,

                0 < P(Ei) < 1.

Axiom 2:             

                P(S) =1

                for the sure event S.

Axiom 3:

                If A and B are mutually exclusive events (subsets of S), then

  P (A or B) = P(A) + P(B).

At least, by these definitions we can assign a number between 0 and 1, as a function, to an event though it is a belief and we do not have supporting evidence to prove or disprove it. We may call this as probability of that event as long as we do not have any evidence. If we talk about data analysis using Bayesian philosophy, we can take these as the prior probability as I said earlier and once the evidence is received we can get the posterior probability using the prior and the evidence. But, it is up to us to decide whether we will take this subjective belief into account for our data analysis or not. Any way we will talk in detail about this now.

Now it is time for me to explain what Bayesian philosophy is. The Bayesian philosophy involves a completely different approach to statistics. The Bayesian version of estimation is considered here for the basic situation concerning the estimation of a parameter given a random sample from a particular distribution. Classical estimation involves the method of maximum likelihood. The fundamental difference between Bayesian and classical methods is that the parameters used for the distribution of data are considered to be a random variable in Bayesian methods. I will explain this below.

 

Let us take an example of a student appearing for an examination for a particular subject. As long as the student has not appeared for the examination, we can take his belief (in terms of percentage or a value between 0 and 1) about scoring different marks as the distribution of his mark as a random number. Let us assume that the student appeared for the exam several time and we keep a note of the numbers he scored in different examinations. This is our evidence. If we consider the distribution as normal then the mean and variance could be called as parameters. For those who do not know what normal distribution is, I can simply say that this is a distribution which says that getting random numbers around the mean is more likely than getting random numbers away from the mean(though there are other distribution which have these characteristics like t-distribution) and if we plot the frequency of random numbers against the numbers we get a nice shape having peak at mean and roughly downward slope towards both ends (Most of the times normal assumption works unless we try to model events with very few discrete outcomes).  If we do not take the prior confidence of the student into account and try to fit the model to the distribution by calculating the mean and variance, that is called maximum likelihood estimate. If we consider a distribution for the prior belief and calculate the posterior parameters by getting the posterior distribution as product of maximum likelihood and prior then this is called maximum A posteriori or MAP estimate. Sometimes we may have to take a complete Bayesian approach of taking a distribution of prior parameters as weights for different models and calculate the outcome accordingly as sum of weighted output of different models.  Based on all these kinds of estimates, we can get the distribution of student’s marks and predict that with what amount of certainty or probability the student will get different marks, if he appears for the exam again and again. Of course these entire thing will be required if we want to make an inference about the student’s performance in the exam and use that for prediction, if required.  There could be different kind of requirements based on different scenarios.

Any way this is a very simple explanation.  We need to have a cost function or error function to find out the validity of our assumption i.e to find out how accurately our model predicts the reality. Then we have to get into other stuffs like optimization techniques for different kind of models based on our assumption about the model to minimize the error so that we can get best estimates for our model parameters.

We will have to make an assumption and make a model. Based on the model, we have to validate our assumption by matching the predicted output with reality. This could be termed as some kind of quality metric. If our assumption is far away from reality, we may have to go for a different assumptions and build  different models based on new assumptions till we get closer to reality i,e the cost function or error function gives least value and we get the best suitable parameters for that model.  This is also applicable for Big Data Analysis though the tools and applications could be different. We will have to always keep one thing in mind. All models are wrong, but some are useful.

Well written... really useful to grab the concept..

Like
Reply

Very insightful ! Explained very well with simple example to connect with the concept.

Like
Reply

To view or add a comment, sign in

More articles by Jyoti Ranjan Nayak

Others also viewed

Explore content categories