Machine Learning and the curse of randomness
Machine Learning and curse of randomness
Before I go to the point, I would like to touch upon two basic and revenant concepts of statistics which Central Limit Theorem (CLT) & Maximum Likelihood Estimates (MLE).
In simple term Central Limit Theorem says that in most situations, when independent random variables are sampled, their properly normalized sum tends toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed. Its best explained by coin tossing experiment where if we toss the coin and large number of observations for Head then these samples follow a bell curve distribution with their mean as the central point.
Ok, now we have something called Central Limit Theorem so as a ML enthusiast what it has to with my model? Here comes another theorem which essentially uses CLT properties. This theorem is known as Maximum Likelihood Estimates (MLE). In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model, given observations. MLE attempts to find the parameter values that maximize the likelihood function, given the observations. The resulting estimate is called a maximum likelihood estimate, which is also abbreviated as MLE.
MLE is a very useful tool to estimate the hypothesis function in ML e.g. we have a model based on regression then how do we find the parameters that’s maximizes error function? An error function is a difference between an estimated value vs observed value. Closer they are better are the parameter estimates. MLE assumes that distribution of independent variables of course for a large sample size follows CLT i.e. sample distribution eventually follows a bell curve distribution. Here lies a key, such normal distribution allows statistician to estimate or predict the values of dependent variable. It’s easy because if distribution is known mathematically then it can be easily translated into hypothesis with given data samples.
Even Bayesian statistics uses MLE but in a slightly different form. From the point of view of Bayesian inference, MLE is a special case of maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters. A maximum likelihood estimator coincides with the most probable Bayesian estimator given a uniform prior distribution on the parameters.
Enough of statistics, lets come back to actual issue. Assume we are trying to build a model for a Credit card fraud detection. Its implementation is straight forward theoretically. Collect the large set of transactions, apply feature engineering and then use this curated data to train the model. Model training uses MLE and CLT both to find the parameter estimates that best fit the model. Will it work so easily? Unfortunately, life is not so easy, why? Reason is the transaction data which is a large sample, but it has very few transactions which indicates frauds! In statistical term, data is unbalanced, and its distribution is skewed and it’s simply an outlier what we are looking for. Now what’s the solution? ML specialist has a way to fix this issue and it’s to make data balanced before training the model, huh! Isn’t it like dressing the culprit first in a crowd before the investigation started?
Above is precisely the reason great author Nassem Taleb hates the bell curve. In his famous book Black Swain, there’s a chapter devoted to explaining this topic. What he has beautifully explained that outliers could be anywhere in the bell curve but unfortunately in any normal distribution the odds of a deviation declining faster and faster (“exponentially”) as you move away from the average!
Another puzzle is if random variable over large sample size follows a bell curve pattern and its used as the basis of prediction then variables are said to be really random? But that's how things are in reality, coin tossing is one such example and even great scientist Galileo sensed this long back, he noticed that errors in measurement for his experiments though they are random but their distribution follows the bell curve.
In short, don’t think that machine learning is a silver bullet specially when your intention is to find the outliers in the random samples like Credit Card Frauds or an anomaly detection in invoice payments and so on. Better be careful than being sorry later.
CLT is very power.