What is Binary Classification?
Binary classification is a supervised learning task where a model predicts one of two possible classes. Examples include:
The goal is to train a model that correctly separates these two categories based on input features.
How Binary Classification Works
1. Data Preparation
2. Choosing an Algorithm
Common algorithms for binary classification include:
3. Model Training
The model learns patterns from the training data to distinguish between the two classes.
4. Making Predictions
The model outputs a probability score between 0 and 1. A threshold (often 0.5) is used to classify:
5. Model Evaluation
To measure performance, we use metrics like:
Hands-On Mini-Challenge: Binary Classification with Logistic Regression
Let’s classify emails as spam or not using a simple Logistic Regression model.
Starter Code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Load dataset (Replace with actual dataset)
emails = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms-spam.csv')
# Prepare features and labels
X = emails['message']
y = (emails['label'] == 'spam').astype(int) # Convert 'spam' to 1, 'ham' to 0
# Convert text data into numerical vectors
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predictions and evaluation
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Best Practices for Binary Classification
Very helpful