Probability and Statistics for Machine Learning
Key Probability Concepts
Random Variables
A random variable is a numerical outcome of a random process. Random variables can be:
Discrete: Countable outcomes (e.g., the result of rolling a die).
Continuous: Infinite possible values (e.g., temperature readings).
Probability Distributions
Probability distributions describe how likely a random variable is to take on a particular value. Key distributions include:
Bernoulli Distribution: Models binary outcomes (e.g., coin flips).
Normal Distribution: A bell-shaped curve, central to many ML models.
Poisson Distribution: Models the number of events in a fixed interval.
Bayes' Theorem
Bayes' theorem is a cornerstone of probabilistic reasoning. It allows us to update probabilities as new evidence becomes available:
This principle underlies models like Naive Bayes classifiers.
Expectation and Variance
Expectation: The average value of a random variable.
Variance: The spread of a random variable around its mean.
Key Statistical Concepts
Descriptive Statistics
Descriptive statistics summarize and describe data:
Mean: Average value.
Median: Middle value.
Mode: Most frequent value.
Standard Deviation: Measure of data dispersion.
Inferential Statistics
Inferential statistics allow us to make conclusions about a population based on a sample:
Hypothesis Testing: Testing assumptions (e.g., t-tests).
Confidence Intervals: Range of values where a parameter likely lies.
Correlation and Causation
Correlation measures the relationship between two variables (e.g., Pearson’s correlation coefficient).
Causation indicates that one variable causes changes in another. Machine learning often deals with correlation but rarely assumes causation.
Applications of Probability and Statistics in Machine Learning
1. Naive Bayes Classifier
Based on Bayes' theorem.
Assumes independence between features.
Effective for text classification and spam filtering.
2. Regression Analysis
Linear regression uses statistical techniques to predict a continuous outcome.
Logistic regression estimates probabilities for binary classification.
3. Evaluation Metrics
Confusion Matrix: Analyzes true positives, false positives, etc.
ROC-AUC: Assesses classification performance.
p-values: Helps determine the statistical significance of features.
4. Sampling Techniques
Bootstrap sampling for model validation.
Stratified sampling to handle imbalanced datasets.
5. Generative Models
Probabilistic models like Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs) rely heavily on probability.