Generative Learning Algorithms
Hey everyone! I’m Pratyush Singh. I’ve put together this article as a study exercise to both solidify my understanding of Generative Learning Algorithms and share what I learn along the way. This article briefly touches upon-
Algorithms that try to learn P(Y|X) — Probability of an output Y, given features X—are called discriminative learning algorithms. If you're unfamiliar with conditional probabilities, this short video explains them well:https://www.youtube.com/watch?v=_IgyaD7vOOA).In classification problems, discriminative algorithms aim to find a decision boundary between classes in a d-dimensional space (where d is the number of features). A new data point is classified based on which side of the boundary it falls on. On the other hand, Generative learning algorithms try to model P(X|Y)( and P(Y)). They learn the distribution of features for each class, then classify a new data point by comparing how well its features match the learned distribution of each class. After modeling (X|Y)( and P(Y)—called the prior probability), the algorithm uses Bayes' rule to derive the posterior probability of Y given X.
If you're not familiar with Bayes' rule, this video provides a simple and intuitive explanation:https://www.youtube.com/watch?v=HZGCoVF3YvM&t=8s
Gaussian Discriminant Analysis
We can use the Gaussian Discriminant Analysis when our features X are continuous and real-valued. The GDA assumes the features X follow a multivariate normal distribution(This is explained at the end of the article, go through it first if you're not familiar with the term). The model is:
Here:
1.Linear Discriminant Analysis:
All output classes share the same covariance matrix.
Implication:
When to use it:
2.Quadratic Discriminant Analysis
Each output class has its own covariance matrix.
Implication:
When to use it:
GDA and Logistic Regression:
When would we prefer one model over another? GDA and logistic regression will, in general, give different decision boundaries when trained on the same dataset.
GDA assumes the distribution of P(X|Y) to be a multivariate gaussian distribution. When this assumption is correct — or approximately correct — GDA tends to outperform logistic regression, often requiring less data to achieve good performance (i.e., it is more data-efficient). In contrast, logistic regression is more robust in practice because it does not assume any specific distribution for P(X∣Y). Therefore, when the true underlying distributions are not Gaussian, logistic regression typically performs better.
Naïve Bayes
When our features X consist of discrete values and we assume the features are conditionally independent of each other given Y. This leads us to the Naïve Bayes classifier and the assumption that the features are conditionally independent of each other, given Y is called the Naïve Bayes assumption. Despite making strong assumptions(which might not be necessarily true), this classifier performs well on many real-world problems.
Recommended by LinkedIn
We can even use this same algorithm on continuous features by discretizing them into bins.
Naïve Bayes is based on Bayes' Theorem:
Under the Naïve Bayes assumption (feature conditional independence given the class label Y):
So the posterior becomes:
This is the core formula used in the classifier. You typically pick the class Y having the maximum probability:
Multivariate Gaussian Distribution
Multivariate Gaussian Distribution is the generalization of the normal(Gaussian) distribution to multiple variables. Instead of single variables now we have a vector of variables(features, in the above case).The multivariate Gaussian describes the joint distribution of all these variables together i.e. they form a normal distribution in d-dimensions(equal to the number of variables we have).
This distribution is characterized by two parameters:
I. Mean Vector: Represents the expected value(mean) of each variable.[d-dimensional]
II. Covariance Matrix: Describes how variables vary individually (variances on the diagonal) and how they relate to each other (covariances off-diagonal). The shape and size of the distribution is determined by this covariance matrix.[d x d - dimensional]
Correlation vs Covariance
Both correlation and covariance measure the linear relationship between two variables. Covariance measures the direction of the linear relationship whereas Correlation measure both strength and direction of the linear relationship.
A positive covariance means both variables move in the same direction(if one increases, so does the other one) whereas a negative covariance means variables move in opposite direction. Covariance lies between -∞ to ∞, where as Correlation lies between -1 to 1. Where 1 means perfect positive relationship, -1 means perfect negative relationship and 0 signifies no relationship. Correlation can also be thought of as a normalized version of Covariance.
I. Covariance between two variables X and Y:
We multiply the difference between each value and its variable's mean for both variables, and then average over all such products
II. Correlation between two variables X and Y:
Correlation standardizes covariance by dividing it by the product of the standard deviations
IIII. Suppose we have d-random variables(features, then the covariance matrix is:
The covariance matrix captures the variances of each variable along its diagonal and the covariances between every pair of variables in the off-diagonal entries, summarizing how all variables vary together.
Definitely worth reading
Good to see 👌🏻