Bayesian Likelihoods as Directional Derivatives (Gradient)

Constantin G

Published Jun 5, 2017

In my upcoming presentation about Algebird for the Scala meetup I am going to talk about Aggregators. They are the quintessential stateful components that start with an initial state that gets updated with any incoming event, message or action. Counters are the simplest example, but pretty much any kind of metric or statistical measure can be represented by an aggregator (including probabilistic data structures like Bloom filters, HyperLogLog, etc.) Aggregators can also represent business components that need to maintain state and process events or requests. They are perfect for implementing streaming or reactive architectures. In functional programming they are called folds. We have all used versions of aggregators. Apache Beam has its own version called CombineFn, nicely visualized in this diagram:

The Bayes' Rule is just another example of such an aggregator:

The prior is the current state that gets updated as the posterior whenever new data is processed. The precise math for the update is:

where θ is one of multiple hypotheses. P(Data | θ) is the probability of the data assuming the θ hypothesis, and is called the likelihood. Note that Bayes' rule is ideal for streaming data, not only for batch processing. Data points can be processed as they arrive, one or more at a time. The posterior of an iteration becomes the prior of the next.

Another scenario that fits an aggregator is updating the value of a function when its input changes: if the current value is f(a) and we want to slightly move the input by a (small) quantity h we end up with f(a + h). Such incremental updates are used in the Gradient Descent algorithm, Backpropagation and many other optimization algorithms. If our function is differentiable and we know the derivative at a we can simplify the computation by using a Taylor approximation for the updated value:

ƒ(a + h) ≈ f(a) + f '(a)h

For a given aggregator it is ideal to have a computationally efficient way to update the state whenever a new event needs to be processed. Bayes' Rule and the Taylor approximation are both giving simple and efficient update mechanisms. But is there a deeper connection between the two? The likelihood and the derivative are both factors that are used in updates. Can we interpret the likelihood as a derivative? We will show that in fact it can be done in a natural way.

When processing multiple trial results using Bayes' Rule it is common to represent the sequence as a tree, like in this example for a coin flip:

The tree can have any depth and at each node we have one edge for each possible data point (we are limiting the discussion to the discrete case):

As we travel from the root to one of the outputs we multiply the likelihoods on the edges.

Now we are going to bring in some calculus on graphs. If we have a real valued function f defined on the vertices of a graph, then the directional derivative at a vertex a is defined for each edge (a, b) as f(b) - f(a). Oliver Knill from Harvard has a series of articles about the subject, and a site dedicated to it (dubbed quantum calculus). You can in fact transfer most calculus and differential geometry results into the graph realm. One of my favorite articles that also contains the directional derivative definition is this, while this is a nice introduction that is more accessible to non-mathematicians.

But the directional derivative along an edge is the difference of the values at the ends of that edge, while the likelihoods in a tree diagram get multiplied. Can we reconcile the difference? Yes, we can. If we take the log (or even better the negative of the log) of the likelihood we get to add along the edges instead of multiplying. The log-likelihood or the negative-log-likelihood are already used for computational reasons instead of the straight likelihood. Once we do that we have a directional derivative for each of the data alternatives, and the Bayes' Rule becomes similar to the Taylor approximation.

We can say that:

- log(Bayes) ∼ Taylor

The Gradient Descent (the Stochastic version is more suitable for streaming than the batch version) and Bayes' Rule are both using derivatives to update outputs or inputs and they are closer than they appear. Some code proof based on aggregators would be nice, wouldn't it?

Note that the typical interpretation of the likelihood when it comes to the Maximum Likelihood Estimation (MLE) or the Maximum A Posteriori (MAP) approximation is that of a fitness measure, so the log-likelihood plays the role of a cost function. As is the case with Bayes' Rule, there are often multiple points of view for the quantities involved.

To view or add a comment, sign in

Bayesian Likelihoods as Directional Derivatives (Gradient)

Constantin G

- log(Bayes) ∼ Taylor

More articles by Constantin G

Others also viewed

Unlocking the Power of Large Datasets with Dask

The Data Pulse #15 - The Design Assumption: What 2,912 Data Professionals Got Wrong About Slicing, Similarity, and Sorting

Are Data Structures and Algorithms obsolete today?

Linear Regression from Scratch with NumPy — Intuition

A Robust Model Construction Strategy - LASSO

Linear Regression Single Variable :A Step-by-Step Guide

Kinetica Spark Connector

generating synthetic data from tiny amount of dataset - A Must-Have Data Science skill !

list of 50+ Python libraries for data science

The Power Of Vectorization

Explore content categories

- log(Bayes) ∼ Taylor

More articles by Constantin G

It's Elementary: Meet Quantum Computing

A Quantum Computing Implementation of Revenue Calculation

Quantum Computing Simulations On Blockchain Platforms

Encoding the Monty Hall Problem in Gen with Julia

Algebraic Data Types, Domains and Services with Julia

An Informal High Level Comparison of QAOA and The Grover Optimizer

The Nature of Quantum Computing: Parallelism Is Easy, Interference Is Hard

How To Think Of Quantum Bits?

Optimizing Quantum Search

Probabilistic Programming Through The Lens Of Quantum Computing

Others also viewed

Unlocking the Power of Large Datasets with Dask

The Data Pulse #15 - The Design Assumption: What 2,912 Data Professionals Got Wrong About Slicing, Similarity, and Sorting

Are Data Structures and Algorithms obsolete today?

Linear Regression from Scratch with NumPy — Intuition

A Robust Model Construction Strategy - LASSO

Linear Regression Single Variable :A Step-by-Step Guide

Kinetica Spark Connector

generating synthetic data from tiny amount of dataset - A Must-Have Data Science skill !

list of 50+ Python libraries for data science

The Power Of Vectorization

Explore content categories