Deep Probabilistic Programming
This week, one of our Data Science professors Thomas Hamelryck introduced us to Deep Probabilistic Programming (Deep PP). Until lately, I have not heard much about this topic but it has caught my interest now. Deep PP aims to combine the advantages of Deep Learning and PP. So what is it about?
Bayesian and frequentist statistics
First, Let us consider two different approaches to probability in statistics. Frequentist and Bayesian statistics. For Frequentists, the probability of an event is the proportion of that event in the long run. For instance, when you throw a coin the proportion of seeing heads will sum up to 50% when you just throw the coin enough times. We make use of this approach when we look at confidence intervals and conduct hypothesis tests. However, we are often more interested in the probability of the occurrence of a particular event in a more concrete setting. This is when the Bayesian Statistics becomes interesting. A Bayesian statistician would begin with a prior distribution, meaning a probability distribution reflecting the state of knowledge before collecting any data. "In Bayesian statistics, probability expresses a degree of belief in an event, which can change as new information is gathered, rather than a fixed value based upon frequency or propensity" (Wikipedia). Often, we have knowledge about the probability as we often have previous experience. We then combine the prior probabilities and the likelihood from the data to get the posterior probability of the event.
Machine Learning & Probabilistic Programming
When it comes to Machine Learning algorithms, this knowledge about probability becomes crucial. In traditional ML approaches, given the data, our objective is to find the best model that describes the data. We feed in the data to many models, learn the parameters from data and then use it to make predictions. However, we do not include any domain knowledge and only learn from the available data.
PP is a tool for statistical modeling and can help ML tasks as it includes domain knowledge and relies on Bayesian statistics. PP allows a mathematical way to input prior beliefs/ assumptions about the data dynamics that you are trying to model. Here, assumptions are encoded with prior distributions over the variables of the model. "PP makes it easy for a developer to define probability models and then “solve” these models automatically. Now, it is a matter of programming that enables a clean separation between modeling and inference" (applying model to unseen data to assess performance of model) (Cronin, B. 2013). All this might sound more complicated but it actually makes many things easier. It can vastly reduce the time and effort associated with implementing new models and understanding data. "Just as high-level programming languages transformed developers productivity by abstracting away the details of the processor and memory architecture, probabilistic languages promise to free the developer from the complexities of high-performance probabilistic inference" (Cronin, B. 2013). It becomes clear that a high level of abstraction is competitive advantage and hence very important in the industry.
Deep Probabilistic Programming
However, PP based on the Bayesian Model has a major disadvantage vs. traditional ML approaches, probabilistic modeling often ends up being (too) computationally intensive. When we apply the model to unseen data (inference), we need the posterior distribution which we typically approximate by sampling. But sampling does not scale to massive data sets and this is when Deep Learning becomes important. Data scientists have developed an automatic differentiation variational inference. Using this method, the scientist only need to provide a probabilistic model and a dataset, nothing else. "ADVI automatically derives an efficient variational inference algorithm, freeing the scientist to refine and explore many models" (Kucukelbir, A. et. al 2017). "Instead of drawing samples from the posterior, these algorithms instead fit a distribution (e.g. normal) to the posterior turning a sampling problem into and optimization problem" (PyData). The Python library Edward, for instance, implements the ADVI and makes it very easy for data scientists. The neural networks in deep learning are extremely good non-linear function approximators and representation learners. It helps us run the inference algorithm and create meaningful results, even when using Big Data.
I think bridging the gap between Probabilistic Programming and Deep Learning is a very exciting field of study and I cannot wait to learn more!
References
Cronin, B. (2013), "What is probabilistic programming?", Available at: https://www.oreilly.com/ideas/probabilistic-programming, Accessed on 23.02.2019
Hamelryck, T. (2019), "Probabilistic programming: A new paradigm in machine learning", lecture of "Introduction to Data Science" at University of Copenhagen
Kucukelbir, A. et. al (2017), Automatic Differentiation Variational Inference, Journal of Machine Learning Research 18 (2017) 1-45, Submitted 3/16; Revised 8/16; Published 1/17, Available at: http://www.jmlr.org/papers/volume18/16-107/16-107.pdf, Accessed on 23.02.2019
PyData.com, Available at: https://pydata.org/london2017/schedule/presentation/15/, Accessed on 23.02.2019
Wikipedia, "Bayesian statistics", Availbale at: https://en.wikipedia.org/wiki/Bayesian_statistics, Accessed on 23.02.2019
Interesting!