The Intuition behind Bayesian Optimisation

The Intuition behind Bayesian Optimisation

Thomas Bayes was a brilliant statistician, philosopher and Presbyterian Minister. His name is associated with several interpretations of probability. Especially with regard to confidence on strength of beliefs and hypotheses. Of specific interest here is his theorem on probability, which provides us with a way to update our beliefs based on the arrival of new and relevant pieces of evidence.

No alt text provided for this image

This representation of his theorem is taken from wikipedia. Lets understand its elements.

P(A|B) - is the posterior - i.e. that which we are trying to predict

P(B|A) - is the likelihood basis new evidence

P(A) - is the prior and P(B) is the marginal likelhiood or normalising factor

So, to tie up the elements I refer to a very beautiful explanation put forth by Devin Soni in his article 'What is Bayes Rule?' posted on towardsdatascience.com

So lets take a few theoretical numbers and put the Bayesian rule to practice. Say we are going about detecting cancer. Current data shows the prevalence of cancer as being 5% in a general population. So if a 100 persons were to be tested - we can safely predict that there would be 95 persons typically without cancer. However, it is also known that smoking increases the risk of cancer. So if we take another assumption and say that 10% of the population smokes and 20% of people who smoke also have cancer, can we use Bayes rule to predict an individual's increased risk cancer considering that the person is also a smoker.

P(A) = .05 (prevalence of cancer - that we typically know)

P(B) = .1 (prevalence of smoking - the marginal likelihood)

p(B|A) = .2 (new evidence - based on cancer patients who also are habitual smokers)

Therefore to calculate the posterior P(A|B) = P(B|A)*P(A)/P(B) = .2*.05/.1 = .1 which means that there is now double the probability i.e. 10% that the person may have cancer if the person also smokes. This sequential enhancement in confidence and belief is what Bayes Theorem is based on and is applied as Bayesian Optimisation in Machine Learning and Artificial Neural Networks - while attempting to find the global minima or maxima.

While applying Bayesian Optimization there are multiple elements to be considered. I have created a notebook file containing the code in python to operationalize it. The link for the code is provided at the end of this article.

The Objective function -

Random Search (random parameter combinations) and Grid search attempt to maximise or minimize an objective function f(x). However, the optimization remains purely a function of chance. It does not take into account the previous results i.e. new evidence (as illustrated above for understanding Bayesian Rule) to improve output.

The Surrogate Function -

Thus, Bayesian Optimization (BO) takes into account previous results - to construct a probability model of the objective function called the surrogate function. This is updated sequentially to identify the new probabilities. In this specific case - it is used by the algorithm to select the next values optimising the probability model i.e. using the surrogate function. Surrogate function is considered to be computationally much cheaper rather than evaluating the objective function itself.

The goal in-itself (as will be seen in the github notebook file - linked below) is that we want to minimize the number of times we evaluate the objective function directly, and rather spend limited resources on the surrogate function to choose better values for the next evaluation.

Probability Model-

The surrogate function thus leads to the probability model - which can be run by several options including e.g. Gaussian Process, Tree Structured Parzen Estimator and say Random Forest Regression; which leads to the domain.

Domain-

To illustrate these concepts I have used HyperOpt parameter domain - which is an open source Python Library for Bayesian Optimization (BO) and implements SMBO (Sequential Model Based Optimisation). It includes an understanding of the search space to sample from, the objective function & surrogate and selection functions. All this is neatly packaged in the Tree Structured Parzen Estimator. These effectively work in conjunction with Gradient Boosting techniques to select those values with the highest Probability of Improvement (PI) or Expected Improvement (EI)

So lets look at the github notebook which can be accessed by clicking here, wherein I first construct a probability model using a surrogate function of the objective function itself. This gets updated as new information gets collected using values where PI / EI is the highest. Over time - this leads to a more accurate representation of the objective function.

A few of the reference links I used for understanding this :-

https://papers.nips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf

http://proceedings.mlr.press/v28/bergstra13.pdf

https://static.sigopt.com/b/20a144d208ef255d3b981ce419667ec25d8412e2/static/pdf/SigOpt_Bayesian_Optimization_Primer.pdf

https://www.cs.ox.ac.uk/people/nando.defreitas/publications/BayesOptLoop.pdf


Lovely @Arijit Chakrabarti da, keep them coming

Like
Reply

To view or add a comment, sign in

More articles by Arijit C.

  • Machine Learning - Modelling and tuning

    Lets face it - machine learning is confusing! More so, when words are used interchangeably despite probably having very…

    9 Comments
  • Client Darling Award -OMD mudramax

    Wow what a send off from mudra! Thank you Shruti for your kind words on e-mail - which was read out to all while…

    3 Comments

Others also viewed

Explore content categories