Getting Started with Support Vector Machines: Theory and Hands-On Applications

Getting Started with Support Vector Machines: Theory and Hands-On Applications

Welcome to the 35th edition of the Engineering Exploration Series!

Continuing our machine learning (ML) journey, in this edition we’ll look into Support Vector Machine (SVM). SVM is a powerful and versatile ML model capable of performing both linear or nonlinear classification and regression tasks. At its core, it works by finding the optimal hyperplane-the decision boundary that best separates different classes in the feature space, while maximising the margin between the closest points of each class, known as support vectors.

In this article, we’ll first explore the theoretical foundation of SVM, and then see how we can practically implement it using scikit-learn (sklearn). To solidify the concepts, we’ll also walk through three hands-on examples. I hope you find this article insightful and applicable!

Introduction

A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm designed for both classification and regression tasks. It is particularly effective in high-dimensional spaces, and is widely used in sentiment analysis task (e.g., classifying movie reviews as positive or negative), image recognition and bioinformatics.

Originally introduced by Vladimir Vapnik in 1990s, SVMs aim to minimise misclassification errors by finding the best possible decision boundary - known as hyperplane in higher-dimensional data, that separates different classes with maximum possible margin.

You might wonder about the term Machine in the name. Let’s explore that briefly.

What is “Machine” in Support Vector Machine?

The word ‘Machine’ in Support Vector Machine dates back to early computer science (1950s-1980s), when the term ‘machine’ was commonly referred to an abstract algorithm or system that performed a specific task - especially classification or decision-making. So, machine in SVM simply means ‘an algorithm that uses vectors to learn and make decisions’.

Decision Rule and Decision Boundaries

Imagine a two-dimensional space where we have mx of positive and negative data points. The question becomes:

How can we separate the positive examples from the negative examples?

One simple approach would be to draw a straight line. But since many straight lines can separate the two classes, which one should we choose?

In the figure below, multiple lines (and many more) can separate the positive and negative classes.


Article content

Some lines come closer to the negative examples, others to the positive examples. However, the best straight line is the one that runs through the middle, maximising the distance between the two classes. In SVM, we aim to find this optimal boundary.


Article content

Let’s redraw the points with two dashed lines to create the widest “street” between the positive and negative samples.

Now suppose we have:

-          a vector of any length w (perpendicular to the median line of the street)

-          and an unknown vector x.

We are interested in determining whether vector x lies on the right side of the street or on the left side of the street. To determine that, we project x onto the vector w.  The dot product is given by:

Article content

Where,

||w|| and ||x|| are the magnitudes of the vectors

theta is the angle between them

This will give us the distance in the direction of vector (w). If the dot product w.x >= c  (where c is a constant), it tells us that x lies on the right side of the decision boundary.

For ease of calculation, we often use the unit vector along w.

Article content

The difference between the two dashed lines (i.e., the width of the “street”) can be expressed as:

Article content
Article content

Without loss of generality (by adjusting the constant c and bias b), the classification conditions become

-        w.x+b>= 0 for +ve sample and

-        w.x+b<= 0 for -ve sample

What is a hyperplane?

Let’s consider a fresh 2D scatter plot as shown in Figure 4, featuring two types of data points: positive (+) and negative (-). Here, the blue dots represent the +ve class while the green dots represent the -ve class.

As ML engineers, our task is to find a line (in 2D space) that separates the two classes cleanly and with the highest confidence. The data points closest to the decision boundary, based on perpendicular distance is known as the margin. The dashed lines on either side of the decision boundary show the extent of the margin, also known as margin extents.


Article content

In the context of the above figure, the symbol * in the figure implies a dot product; it means wTx, where wT is the transpose of the weight vector w.

In the figure:

-          The red line in the centre of the margin wTx – b = 0 is the decision hyperplane that separates the two classes (blue and green points).

-          The blue dashed line wTx – b = 1 and green dashed line wTx – b = -1  are called the supporting hyperplanes.

Maximum margin intuition

Why do we aim to maximise the margin?

The intuition is simple: by maximising the margin, the SVM creates a more reliable and robust separation between classes, leading to better generalisation of unseen data. In the figure, the margin is the yellow region between the two dashed supporting hyperplanes. Mathematically, the margin width is where ||w|| is the norm (or length) of the weight vector w.

Thus, SVM tries to maximise this margin-in effect, minimising ||w|| to make the classifier more confident and resilient to new data.

What are Support Vectors?

The support vectors are the key data points that are closest to the hyperplane (or decision boundary) that the SVM uses to separate different classes. In our example

-          The blue points touching the line wTx – b = 1  and

-          The green points touching wTx – b = -1 are the support vectors.

The support vectors define the margin; if you move these points, the margin (and thus the classifier) would change.

When training a support vector machine, the algorithm searches for the decision boundary that leads to maximum margin, guided specifically by these support vectors. This approach ensures that the model doesn’t just separate the classes but does so with the maximum confidence.

Kernel Function

Up to now, we’ve seen SVMs working in relatively simple scenarios where the data can be separated by a straight line (or hyperplane in higher dimensions). However,  real-world datasets are often not linearly separable in their original feature space. In such cases, an additional step is needed: transforming the data into a higher-dimensional space where it becomes linearly separable.

This is where kernel functions come into play.

A kernel function is a mathematical tool that allows the SVM to implicitly map the input features into a higher-dimensional space without explicitly computing the transformation. By using kernels, SVMs can find nonlinear boundaries in the original input space while still solving a linear problem in the transformed space.

Common Types of Kernel in Scikit-learn

The SVC class from sklearn.svm provides several built-in kernel options for training Support Vector Machines. Besides the built-in ones, scilit-learn also allows users to define custom kernels if needed.  For linearly separable datasets, we typically use the linear kernel, but for more complex datasets, we have several other kernels in our arsenal:

  • Linear Kernel : Used when data is linearly separable.

Article content
from sklearn.svm import SVC
# Train SVM (Linear)
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)        

  • Polynomial Kernel

Article content

It has scale factor (gamma), constant term (r), and degree of polynomial (d). This is suitable for the data where interaction between features matters (e.g., XOR-type problems)

# Train SVM (Polynomial Kernel, degree: default=3, gamma: (default=’scale’), coef0: r term, default=0)
clf = SVC(kernel='poly')
clf.fit(X_train, y_train)        

  • Radial Basis Function (RBF)/ Gaussian Kernel


Article content

g controls the spread of curve (higher g à tighter curves). It is most popular for nonlinear problems. Finds circular or oval-shaped decision boundaries.

  • Sigmoid Kernel


Article content

Inspired by neural networks (activation functions). It is rarely used today, as it’s often less effective the RBF or polynomial

Example Problems

To solidify our understanding, we’ll walk through three practical examples:

-          Linearly separable dataset using a simple linear kernel

-          Nonlinearly separable dataset that cannot be separated by a sinple hyperplane

-          Comparative study showing how different kernels perform across various datasets.

 

These examples will give you a hands-on insight into how different kernels behave and the importance of choosing the right one for your task.

Example-1: Iris Dataset (Linearly Separable)

Lets start with classical Iris dataset from sklearn.datasets. This is multiclass dataset, which has three classes labelled 0, 1, and 2 which represents three iris species (Setosa, Versicolor, Virginica). There are 150 samples and each sample has 4 features (sepal length, sepal width, petal length, petal width)


Article content
Image Source:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import os


def plot_decision_boundary(clf, X, y, title):
    h = .02  # step size in the mesh

    # Create a mesh to plot the decision boundary
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, edgecolors='k')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(title)
    plt.grid(True)
    plt.tight_layout()
    return plt

# Load dataset and filter for binary classification
iris = datasets.load_iris()
X = iris.data[iris.target != 2, :2]  # Filter out any sample that belongs to class 2, and onle select first two features (column 0 and 1)
y = iris.target[iris.target != 2]    # Only class 0 and 1 (or where target NOT EQUAL to 2)

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Scale data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train SVM (Linear)
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)

plt1 = plot_decision_boundary(clf, np.vstack([X_train, X_test]),
                              np.hstack([y_train, y_test]),
                              "SVM with Linear Kernel (Iris Binary Classification)")

# Score
print("Accuracy:", clf.score(X_test, y_test))

# Saving Figure
fig = os.path.join(os.getcwd(), f'Iris_plots.png')
plt.savefig(fig, dpi=300)
plt.show()        


Article content

Example-2: Nonlinear Dataset - Concentric Circles

For out second example, we will use make_circles() function from sklearn.datasets, which generates a 2D binary classification dataset where the data points are arranged in two concentric circles. Smaller inner circle and larger outer circle. This dataset is best suited for testing algorithms that can handle nonlinear boundaries. Hence we are trying this with out Radial Basis Function (RBF) kernel. The function takes four parameters (n_sample, noise, factor, random_state), if not specified, the default n_sample is 100. Higher noise makes circle messy and overlap each other making harder to classify. Higher factor (~1) makes both the circles almost overlap, where as lower factor (~0) shrinks the inner circle to almost a point.

from sklearn.datasets import make_circles
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import os


def plot_decision_boundary(clf, X, y, title):
    h = .02  # step size in the mesh

    # Create a mesh to plot the decision boundary
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, edgecolors='k')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(title)
    plt.grid(True)
    plt.tight_layout()
    return plt

# Generate circular data
X, y = make_circles(n_samples=300, factor=0.5, noise=0.05, random_state=0)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train SVM with RBF kernel
clf = SVC(kernel='rbf', C=1, gamma='scale')
clf.fit(X_train, y_train)

plt2 = plot_decision_boundary(clf, np.vstack([X_train, X_test]),
                              np.hstack([y_train, y_test]),
                              "SVM with RBF Kernel (make_circles Dataset)")

# Score
print("Accuracy:", clf.score(X_test, y_test))
# Saving Figure
fig = os.path.join(os.getcwd(), f'Circles_plots.png')
plt.savefig(fig, dpi=300)
plt.show()        


Article content

Example-3: Comparing Kernels Across Different Datasets

In the third and final example for this article, we will explore three different datasetsall generated using scikit-learn; make_moons(), make_circles() and make_classificaiton(). The details of each datasets are as follows:

make_moons(noise=0.3, random_state=0)

  • Generates a 2D (non-linearly separaable) binary classification dataset with two interleaving half circles (like crescent moons)
  • noise=0.3 adds Gaussian noise to make the classification problem harder

make_cicrcles (noise=0.2, factor=0.5, randome_state=1)

  • Generates a large circel containing a smaller circle in 2D (binary classification task) to test concentric circular patters
  • factor =0.5 controls the distance between the inner and outer circles

make_classification(n_feature=2, n_reducndat=0, nn_informative=2, random_state=1, n_clusters_per_class=1)

  • Generates synthetic, linearly separable classification dataset
  • n_feaure = 2: Only two features, for easy plot
  • n_information =2: Use both features for classification
  • n_clusters_per_class=1: Keeps class distribution simple

We’ll apply various SVM kernels and compare their performance.

# Original Source Modified from: https://gist.github.com/WittmannF/60680723ed8dd0cb993051a7448f7805

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification # Data generators provided by scikit-learn to create synthetic datasets
from sklearn.svm import SVC
import os

# Step size for the mesh grid used in plotting decision boundaries
h = .02

# Define names and corresponding CSM classifiers with different kernels
names = ["Linear SVM", "RBF SVM", "Poly SVM", "Sigmoid SVM"]
classifiers = [
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    SVC(kernel="poly", C=0.025),
    SVC(kernel="sigmoid", gamma=2)]

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1)

linearly_separable = (X, y)

# The default number of sample in each dataset is 100 (n_samples=100)
datasets = [make_moons(noise=0.3, random_state=0),                  # Moon shaped dataset
            make_circles(noise=0.2, factor=0.5, random_state=1),    # Circle shaped dataset
            linearly_separable                                      # Linearly separable dataset
            ]

# Set up the overall figure
figure = plt.figure(figsize=(27, 9))
i = 1 # index for sub-plotting figures

# iterate over datasets
for ds in datasets:
    X, y = ds
    X = StandardScaler().fit_transform(X) # Standardize features
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4) # Train-test split

    # Define mesh boundaries for plotting
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    # Plot raw dataset (first column)
    cm = plt.cm.RdBu # Use predefined colormap
    cm_bright = ListedColormap(['#FF0000', '#0000FF']) # Custom color map with Pure RED and Pure BLUE
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1

    # iterate through classifiers and visualize results
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        clf.fit(X_train, y_train) # Train classifier
        score = clf.score(X_test, y_test) # Accuracy score on test set

        # Plot the decision boundary. For that, we will assign a color to each
        # point in the mesh [x_min, m_max]x[y_min, y_max].
        if hasattr(clf, "decision_function"):
            Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
        else:
            Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

        # Plot also the training points
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
        # Plot testing points
        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
                   alpha=0.6)

        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        ax.set_title(name)
        # Add a text label on the plot showing classifier's accuracy score (on bottom right corner)
        ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score), size=15, horizontalalignment='right')
        i += 1

# Adjust subplot layout and save figure
figure.subplots_adjust(left=.02, right=.98)
# Saving Figure
fig = os.path.join(os.getcwd(), f'Combined_plots.png')
plt.savefig(fig, dpi=300)
plt.show()        


Article content

Conclusion

In this article, we explored the fundamentals of Support Vector Machines (SVMs), from understanding hyperplanes and margins to the role of support vectors and kernel functions. We discussed the different types of kernels available in scikit-learn and how they help tackle both linear and nonlinear classification problems. Through practical examples, we saw how the choice of kernel can dramatically impact model performance depending on the dataset’s characteristics.

Mastering SVMs and selecting the right kernel is a valuable skill for any machine learning practitioner, especially when dealing with complex or high-dimensional data. With a strong foundation in these concepts, you’ll be better equipped to apply SVMs, effectively in real-world tasks.

References

1.    Lecture by Professor Patrick Winston, MIT 6.034 Artificial Intelligence, Fall 2010, https://www.youtube.com/watch?v=_PwhiWxHK8o&t=29s

2.    Aurelien Geron, Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow, O’Reilly, 2nd Edition

3.    John D. Kelleher, Brain Mac Namee, Aoife D’Arcy, “Fundamentals of Machinie Learning for Predictive Data Analytics, Algorithms, MIT Press, 2015

4.    Sebastian Raschka, “Python Machine Learning”, Packt Publishing, 2015

5.    Scikit-learn API, https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

6.    https://gist.github.com/WittmannF/60680723ed8dd0cb993051a7448f7805

http://suruchifialoke.com/2016-10-13-machine-learning-tutorial-iris-classification/


To view or add a comment, sign in

More articles by Binayak Bhandari, Ph.D., PE

Others also viewed

Explore content categories