Bayesian Likelihoods as Directional Derivatives (Gradient)
http://www.mathematics-monster.com/slider/probabilityortreediagram.html

Bayesian Likelihoods as Directional Derivatives (Gradient)

In my upcoming presentation about Algebird for the Scala meetup I am going to talk about Aggregators. They are the quintessential stateful components that start with an initial state that gets updated with any incoming event, message or action. Counters are the simplest example, but pretty much any kind of metric or statistical measure can be represented by an aggregator (including probabilistic data structures like Bloom filters, HyperLogLog, etc.) Aggregators can also represent business components that need to maintain state and process events or requests. They are perfect for implementing streaming or reactive architectures. In functional programming they are called folds. We have all used versions of aggregators. Apache Beam has its own version called CombineFn, nicely visualized in this diagram:

The Bayes' Rule is just another example of such an aggregator:

The prior is the current state that gets updated as the posterior whenever new data is processed. The precise math for the update is:

where θ is one of multiple hypotheses. P(Data | θ) is the probability of the data assuming the θ hypothesis, and is called the likelihood. Note that Bayes' rule is ideal for streaming data, not only for batch processing. Data points can be processed as they arrive, one or more at a time. The posterior of an iteration becomes the prior of the next.

Another scenario that fits an aggregator is updating the value of a function when its input changes: if the current value is f(a) and we want to slightly move the input by a (small) quantity h we end up with f(a + h). Such incremental updates are used in the Gradient Descent algorithm, Backpropagation and many other optimization algorithms. If our function is differentiable and we know the derivative at a we can simplify the computation by using a Taylor approximation for the updated value:

ƒ(a + h) ≈ f(a) + f '(a)h

For a given aggregator it is ideal to have a computationally efficient way to update the state whenever a new event needs to be processed. Bayes' Rule and the Taylor approximation are both giving simple and efficient update mechanisms. But is there a deeper connection between the two? The likelihood and the derivative are both factors that are used in updates. Can we interpret the likelihood as a derivative? We will show that in fact it can be done in a natural way.

When processing multiple trial results using Bayes' Rule it is common to represent the sequence as a tree, like in this example for a coin flip:

The tree can have any depth and at each node we have one edge for each possible data point (we are limiting the discussion to the discrete case):

As we travel from the root to one of the outputs we multiply the likelihoods on the edges.

Now we are going to bring in some calculus on graphs. If we have a real valued function f defined on the vertices of a graph, then the directional derivative at a vertex a is defined for each edge (a, b) as f(b) - f(a). Oliver Knill from Harvard has a series of articles about the subject, and a site dedicated to it (dubbed quantum calculus). You can in fact transfer most calculus and differential geometry results into the graph realm. One of my favorite articles that also contains the directional derivative definition is this, while this is a nice introduction that is more accessible to non-mathematicians.

But the directional derivative along an edge is the difference of the values at the ends of that edge, while the likelihoods in a tree diagram get multiplied. Can we reconcile the difference? Yes, we can. If we take the log (or even better the negative of the log) of the likelihood we get to add along the edges instead of multiplying. The log-likelihood or the negative-log-likelihood are already used for computational reasons instead of the straight likelihood. Once we do that we have a directional derivative for each of the data alternatives, and the Bayes' Rule becomes similar to the Taylor approximation.

We can say that:

- log(Bayes) ∼ Taylor

The Gradient Descent (the Stochastic version is more suitable for streaming than the batch version) and Bayes' Rule are both using derivatives to update outputs or inputs and they are closer than they appear. Some code proof based on aggregators would be nice, wouldn't it?

Note that the typical interpretation of the likelihood when it comes to the Maximum Likelihood Estimation (MLE) or the Maximum A Posteriori (MAP) approximation is that of a fitness measure, so the log-likelihood plays the role of a cost function. As is the case with Bayes' Rule, there are often multiple points of view for the quantities involved.

To view or add a comment, sign in

More articles by Constantin G

Others also viewed

Explore content categories