Generative Learning Algorithms

Hey everyone! I’m Pratyush Singh. I’ve put together this article as a study exercise to both solidify my understanding of Generative Learning Algorithms and share what I learn along the way. This article briefly touches upon-

  • Comparison between Discriminative and Generative Learning Algorithms
  • Correlation and Covariance
  • The Multivariate Gaussian Distribution
  • Gaussian Discriminant Analysis (including Linear Discriminant Analysis and Quadratic Discriminant Analysis)
  • The Naïve Bayes Classifier

Algorithms that try to learn P(Y|X) — Probability of an output Y, given features X—are called discriminative learning algorithms. If you're unfamiliar with conditional probabilities, this short video explains them well:https://www.youtube.com/watch?v=_IgyaD7vOOA).In classification problems, discriminative algorithms aim to find a decision boundary between classes in a d-dimensional space (where d is the number of features). A new data point is classified based on which side of the boundary it falls on. On the other hand, Generative learning algorithms try to model P(X|Y)( and P(Y)). They learn the distribution of features for each class, then classify a new data point by comparing how well its features match the learned distribution of each class. After modeling (X|Y)( and P(Y)—called the prior probability), the algorithm uses Bayes' rule to derive the posterior probability of Y given X.

Article content

If you're not familiar with Bayes' rule, this video provides a simple and intuitive explanation:https://www.youtube.com/watch?v=HZGCoVF3YvM&t=8s

Gaussian Discriminant Analysis

We can use the Gaussian Discriminant Analysis when our features X are continuous and real-valued. The GDA assumes the features X follow a multivariate normal distribution(This is explained at the end of the article, go through it first if you're not familiar with the term). The model is:

Article content

Here:

  • ~ means "is distributed as."
  • ϕ (phi) is the probability parameter for the Bernoulli distribution or we can say, the class prior probability(e.g. P(Y=1) in a binary classification problem)
  • μ₀, μ₁ are the mean vectors of the Gaussian distributions for each class.
  • Σ is the covariance matrix or matrices depending on the model type(I have explained the concept of Covariance at the end of article).

1.Linear Discriminant Analysis:

All output classes share the same covariance matrix.

Implication:

  • The decision boundary is linear in the feature space.
  • This simplifies the computation and reduces the number of parameters.
  • More stable but less flexible.

When to use it:

  • When you have limited data.
  • When you suspect the classes have similar spread and orientation in feature space.

2.Quadratic Discriminant Analysis

Each output class has its own covariance matrix.

Implication:

  • The decision boundary is quadratic in the feature space and requires more parameters to estimate comparatively.
  • More flexible and can capture curved boundaries but with a higher risk of overfitting

When to use it:

  • When you have enough data to estimate separate covariance matrices.
  • When classes have very different spreads or orientations in the feature space.

GDA and Logistic Regression:

When would we prefer one model over another? GDA and logistic regression will, in general, give different decision boundaries when trained on the same dataset.

GDA assumes the distribution of P(X|Y) to be a multivariate gaussian distribution. When this assumption is correct — or approximately correct — GDA tends to outperform logistic regression, often requiring less data to achieve good performance (i.e., it is more data-efficient). In contrast, logistic regression is more robust in practice because it does not assume any specific distribution for P(X∣Y). Therefore, when the true underlying distributions are not Gaussian, logistic regression typically performs better.

Naïve Bayes

When our features X consist of discrete values and we assume the features are conditionally independent of each other given Y. This leads us to the Naïve Bayes classifier and the assumption that the features are conditionally independent of each other, given Y is called the Naïve Bayes assumption. Despite making strong assumptions(which might not be necessarily true), this classifier performs well on many real-world problems.

We can even use this same algorithm on continuous features by discretizing them into bins.

Naïve Bayes is based on Bayes' Theorem:

Article content

Under the Naïve Bayes assumption (feature conditional independence given the class label Y):

Article content

So the posterior becomes:

Article content

This is the core formula used in the classifier. You typically pick the class Y having the maximum probability:

Article content

Multivariate Gaussian Distribution

Multivariate Gaussian Distribution is the generalization of the normal(Gaussian) distribution to multiple variables. Instead of single variables now we have a vector of variables(features, in the above case).The multivariate Gaussian describes the joint distribution of all these variables together i.e. they form a normal distribution in d-dimensions(equal to the number of variables we have).

This distribution is characterized by two parameters:

I. Mean Vector: Represents the expected value(mean) of each variable.[d-dimensional]

II. Covariance Matrix: Describes how variables vary individually (variances on the diagonal) and how they relate to each other (covariances off-diagonal). The shape and size of the distribution is determined by this covariance matrix.[d x d - dimensional]

Article content

Correlation vs Covariance

Both correlation and covariance measure the linear relationship between two variables. Covariance measures the direction of the linear relationship whereas Correlation measure both strength and direction of the linear relationship.

A positive covariance means both variables move in the same direction(if one increases, so does the other one) whereas a negative covariance means variables move in opposite direction. Covariance lies between -∞ to ∞, where as Correlation lies between -1 to 1. Where 1 means perfect positive relationship, -1 means perfect negative relationship and 0 signifies no relationship. Correlation can also be thought of as a normalized version of Covariance.

I. Covariance between two variables X and Y:

Article content

We multiply the difference between each value and its variable's mean for both variables, and then average over all such products

II. Correlation between two variables X and Y:

Article content

Correlation standardizes covariance by dividing it by the product of the standard deviations

IIII. Suppose we have d-random variables(features, then the covariance matrix is:

Article content

The covariance matrix captures the variances of each variable along its diagonal and the covariances between every pair of variables in the off-diagonal entries, summarizing how all variables vary together.

To view or add a comment, sign in

More articles by Pratyush Singh

  • Student's t-Distribution and t-tests

    A t-distribution, also known as the Student's t-distribution, is a family of continuous probability distributions that…

    2 Comments
  • Main Challenges to Machine Learning

    Our main task in machine learning is to select a machine learning algorithm and train it using some data, So, the two…

    2 Comments
  • Types of Machine Learning Systems

    Machine Learning Systems could be of various types depending on the criteria we're using to classify those systems…

  • An Illustrative introduction to Transformers(Part 2/3)

    Link: Part 1 Hey everyone ! In the previous article , I introduced you to the transformer architecture, highlighting…

  • An Illustrative introduction to Transformers(Part 1/3)

    Link : Part 2 “Attention Is All You Need” was the name of the research paper that came out in 2017, revolutionizing the…

Others also viewed

Explore content categories