Getting Started with Support Vector Machines: Theory and Hands-On Applications
Welcome to the 35th edition of the Engineering Exploration Series!
Continuing our machine learning (ML) journey, in this edition we’ll look into Support Vector Machine (SVM). SVM is a powerful and versatile ML model capable of performing both linear or nonlinear classification and regression tasks. At its core, it works by finding the optimal hyperplane-the decision boundary that best separates different classes in the feature space, while maximising the margin between the closest points of each class, known as support vectors.
In this article, we’ll first explore the theoretical foundation of SVM, and then see how we can practically implement it using scikit-learn (sklearn). To solidify the concepts, we’ll also walk through three hands-on examples. I hope you find this article insightful and applicable!
Introduction
A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm designed for both classification and regression tasks. It is particularly effective in high-dimensional spaces, and is widely used in sentiment analysis task (e.g., classifying movie reviews as positive or negative), image recognition and bioinformatics.
Originally introduced by Vladimir Vapnik in 1990s, SVMs aim to minimise misclassification errors by finding the best possible decision boundary - known as hyperplane in higher-dimensional data, that separates different classes with maximum possible margin.
You might wonder about the term Machine in the name. Let’s explore that briefly.
What is “Machine” in Support Vector Machine?
The word ‘Machine’ in Support Vector Machine dates back to early computer science (1950s-1980s), when the term ‘machine’ was commonly referred to an abstract algorithm or system that performed a specific task - especially classification or decision-making. So, machine in SVM simply means ‘an algorithm that uses vectors to learn and make decisions’.
Decision Rule and Decision Boundaries
Imagine a two-dimensional space where we have mx of positive and negative data points. The question becomes:
How can we separate the positive examples from the negative examples?
One simple approach would be to draw a straight line. But since many straight lines can separate the two classes, which one should we choose?
In the figure below, multiple lines (and many more) can separate the positive and negative classes.
Some lines come closer to the negative examples, others to the positive examples. However, the best straight line is the one that runs through the middle, maximising the distance between the two classes. In SVM, we aim to find this optimal boundary.
Let’s redraw the points with two dashed lines to create the widest “street” between the positive and negative samples.
Now suppose we have:
- a vector of any length w (perpendicular to the median line of the street)
- and an unknown vector x.
We are interested in determining whether vector x lies on the right side of the street or on the left side of the street. To determine that, we project x onto the vector w. The dot product is given by:
Where,
||w|| and ||x|| are the magnitudes of the vectors
theta is the angle between them
This will give us the distance in the direction of vector (w). If the dot product w.x >= c (where c is a constant), it tells us that x lies on the right side of the decision boundary.
For ease of calculation, we often use the unit vector along w.
The difference between the two dashed lines (i.e., the width of the “street”) can be expressed as:
Without loss of generality (by adjusting the constant c and bias b), the classification conditions become
- w.x+b>= 0 for +ve sample and
- w.x+b<= 0 for -ve sample
What is a hyperplane?
Let’s consider a fresh 2D scatter plot as shown in Figure 4, featuring two types of data points: positive (+) and negative (-). Here, the blue dots represent the +ve class while the green dots represent the -ve class.
As ML engineers, our task is to find a line (in 2D space) that separates the two classes cleanly and with the highest confidence. The data points closest to the decision boundary, based on perpendicular distance is known as the margin. The dashed lines on either side of the decision boundary show the extent of the margin, also known as margin extents.
In the context of the above figure, the symbol * in the figure implies a dot product; it means wTx, where wT is the transpose of the weight vector w.
In the figure:
- The red line in the centre of the margin wTx – b = 0 is the decision hyperplane that separates the two classes (blue and green points).
- The blue dashed line wTx – b = 1 and green dashed line wTx – b = -1 are called the supporting hyperplanes.
Maximum margin intuition
Why do we aim to maximise the margin?
The intuition is simple: by maximising the margin, the SVM creates a more reliable and robust separation between classes, leading to better generalisation of unseen data. In the figure, the margin is the yellow region between the two dashed supporting hyperplanes. Mathematically, the margin width is where ||w|| is the norm (or length) of the weight vector w.
Thus, SVM tries to maximise this margin-in effect, minimising ||w|| to make the classifier more confident and resilient to new data.
What are Support Vectors?
The support vectors are the key data points that are closest to the hyperplane (or decision boundary) that the SVM uses to separate different classes. In our example
- The blue points touching the line wTx – b = 1 and
- The green points touching wTx – b = -1 are the support vectors.
The support vectors define the margin; if you move these points, the margin (and thus the classifier) would change.
When training a support vector machine, the algorithm searches for the decision boundary that leads to maximum margin, guided specifically by these support vectors. This approach ensures that the model doesn’t just separate the classes but does so with the maximum confidence.
Kernel Function
Up to now, we’ve seen SVMs working in relatively simple scenarios where the data can be separated by a straight line (or hyperplane in higher dimensions). However, real-world datasets are often not linearly separable in their original feature space. In such cases, an additional step is needed: transforming the data into a higher-dimensional space where it becomes linearly separable.
This is where kernel functions come into play.
A kernel function is a mathematical tool that allows the SVM to implicitly map the input features into a higher-dimensional space without explicitly computing the transformation. By using kernels, SVMs can find nonlinear boundaries in the original input space while still solving a linear problem in the transformed space.
Recommended by LinkedIn
Common Types of Kernel in Scikit-learn
The SVC class from sklearn.svm provides several built-in kernel options for training Support Vector Machines. Besides the built-in ones, scilit-learn also allows users to define custom kernels if needed. For linearly separable datasets, we typically use the linear kernel, but for more complex datasets, we have several other kernels in our arsenal:
from sklearn.svm import SVC
# Train SVM (Linear)
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)
It has scale factor (gamma), constant term (r), and degree of polynomial (d). This is suitable for the data where interaction between features matters (e.g., XOR-type problems)
# Train SVM (Polynomial Kernel, degree: default=3, gamma: (default=’scale’), coef0: r term, default=0)
clf = SVC(kernel='poly')
clf.fit(X_train, y_train)
g controls the spread of curve (higher g à tighter curves). It is most popular for nonlinear problems. Finds circular or oval-shaped decision boundaries.
Inspired by neural networks (activation functions). It is rarely used today, as it’s often less effective the RBF or polynomial
Example Problems
To solidify our understanding, we’ll walk through three practical examples:
- Linearly separable dataset using a simple linear kernel
- Nonlinearly separable dataset that cannot be separated by a sinple hyperplane
- Comparative study showing how different kernels perform across various datasets.
These examples will give you a hands-on insight into how different kernels behave and the importance of choosing the right one for your task.
Example-1: Iris Dataset (Linearly Separable)
Lets start with classical Iris dataset from sklearn.datasets. This is multiclass dataset, which has three classes labelled 0, 1, and 2 which represents three iris species (Setosa, Versicolor, Virginica). There are 150 samples and each sample has 4 features (sepal length, sepal width, petal length, petal width)
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import os
def plot_decision_boundary(clf, X, y, title):
h = .02 # step size in the mesh
# Create a mesh to plot the decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title(title)
plt.grid(True)
plt.tight_layout()
return plt
# Load dataset and filter for binary classification
iris = datasets.load_iris()
X = iris.data[iris.target != 2, :2] # Filter out any sample that belongs to class 2, and onle select first two features (column 0 and 1)
y = iris.target[iris.target != 2] # Only class 0 and 1 (or where target NOT EQUAL to 2)
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Scale data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train SVM (Linear)
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)
plt1 = plot_decision_boundary(clf, np.vstack([X_train, X_test]),
np.hstack([y_train, y_test]),
"SVM with Linear Kernel (Iris Binary Classification)")
# Score
print("Accuracy:", clf.score(X_test, y_test))
# Saving Figure
fig = os.path.join(os.getcwd(), f'Iris_plots.png')
plt.savefig(fig, dpi=300)
plt.show()
Example-2: Nonlinear Dataset - Concentric Circles
For out second example, we will use make_circles() function from sklearn.datasets, which generates a 2D binary classification dataset where the data points are arranged in two concentric circles. Smaller inner circle and larger outer circle. This dataset is best suited for testing algorithms that can handle nonlinear boundaries. Hence we are trying this with out Radial Basis Function (RBF) kernel. The function takes four parameters (n_sample, noise, factor, random_state), if not specified, the default n_sample is 100. Higher noise makes circle messy and overlap each other making harder to classify. Higher factor (~1) makes both the circles almost overlap, where as lower factor (~0) shrinks the inner circle to almost a point.
from sklearn.datasets import make_circles
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import os
def plot_decision_boundary(clf, X, y, title):
h = .02 # step size in the mesh
# Create a mesh to plot the decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title(title)
plt.grid(True)
plt.tight_layout()
return plt
# Generate circular data
X, y = make_circles(n_samples=300, factor=0.5, noise=0.05, random_state=0)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train SVM with RBF kernel
clf = SVC(kernel='rbf', C=1, gamma='scale')
clf.fit(X_train, y_train)
plt2 = plot_decision_boundary(clf, np.vstack([X_train, X_test]),
np.hstack([y_train, y_test]),
"SVM with RBF Kernel (make_circles Dataset)")
# Score
print("Accuracy:", clf.score(X_test, y_test))
# Saving Figure
fig = os.path.join(os.getcwd(), f'Circles_plots.png')
plt.savefig(fig, dpi=300)
plt.show()
Example-3: Comparing Kernels Across Different Datasets
In the third and final example for this article, we will explore three different datasetsall generated using scikit-learn; make_moons(), make_circles() and make_classificaiton(). The details of each datasets are as follows:
make_moons(noise=0.3, random_state=0)
make_cicrcles (noise=0.2, factor=0.5, randome_state=1)
make_classification(n_feature=2, n_reducndat=0, nn_informative=2, random_state=1, n_clusters_per_class=1)
We’ll apply various SVM kernels and compare their performance.
# Original Source Modified from: https://gist.github.com/WittmannF/60680723ed8dd0cb993051a7448f7805
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification # Data generators provided by scikit-learn to create synthetic datasets
from sklearn.svm import SVC
import os
# Step size for the mesh grid used in plotting decision boundaries
h = .02
# Define names and corresponding CSM classifiers with different kernels
names = ["Linear SVM", "RBF SVM", "Poly SVM", "Sigmoid SVM"]
classifiers = [
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
SVC(kernel="poly", C=0.025),
SVC(kernel="sigmoid", gamma=2)]
X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
random_state=1, n_clusters_per_class=1)
linearly_separable = (X, y)
# The default number of sample in each dataset is 100 (n_samples=100)
datasets = [make_moons(noise=0.3, random_state=0), # Moon shaped dataset
make_circles(noise=0.2, factor=0.5, random_state=1), # Circle shaped dataset
linearly_separable # Linearly separable dataset
]
# Set up the overall figure
figure = plt.figure(figsize=(27, 9))
i = 1 # index for sub-plotting figures
# iterate over datasets
for ds in datasets:
X, y = ds
X = StandardScaler().fit_transform(X) # Standardize features
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4) # Train-test split
# Define mesh boundaries for plotting
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Plot raw dataset (first column)
cm = plt.cm.RdBu # Use predefined colormap
cm_bright = ListedColormap(['#FF0000', '#0000FF']) # Custom color map with Pure RED and Pure BLUE
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
i += 1
# iterate through classifiers and visualize results
for name, clf in zip(names, classifiers):
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
clf.fit(X_train, y_train) # Train classifier
score = clf.score(X_test, y_test) # Accuracy score on test set
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
# Put the result into a color plot
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)
# Plot also the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
# Plot testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
alpha=0.6)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(name)
# Add a text label on the plot showing classifier's accuracy score (on bottom right corner)
ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score), size=15, horizontalalignment='right')
i += 1
# Adjust subplot layout and save figure
figure.subplots_adjust(left=.02, right=.98)
# Saving Figure
fig = os.path.join(os.getcwd(), f'Combined_plots.png')
plt.savefig(fig, dpi=300)
plt.show()
Conclusion
In this article, we explored the fundamentals of Support Vector Machines (SVMs), from understanding hyperplanes and margins to the role of support vectors and kernel functions. We discussed the different types of kernels available in scikit-learn and how they help tackle both linear and nonlinear classification problems. Through practical examples, we saw how the choice of kernel can dramatically impact model performance depending on the dataset’s characteristics.
Mastering SVMs and selecting the right kernel is a valuable skill for any machine learning practitioner, especially when dealing with complex or high-dimensional data. With a strong foundation in these concepts, you’ll be better equipped to apply SVMs, effectively in real-world tasks.
References
1. Lecture by Professor Patrick Winston, MIT 6.034 Artificial Intelligence, Fall 2010, https://www.youtube.com/watch?v=_PwhiWxHK8o&t=29s
2. Aurelien Geron, Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow, O’Reilly, 2nd Edition
3. John D. Kelleher, Brain Mac Namee, Aoife D’Arcy, “Fundamentals of Machinie Learning for Predictive Data Analytics, Algorithms, MIT Press, 2015
4. Sebastian Raschka, “Python Machine Learning”, Packt Publishing, 2015
5. Scikit-learn API, https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html