K-Means Clustering Algorithm
What is clustering?
Clustering is the process of grouping together objects so that those in the same grouping (cluster) have more similarities in common with those in their group than those in other groups.
Why would we want to cluster?
Clustering is important in data analysis and data mining applications. It is the task of grouping a set of objects so that objects in the same group are more similar to each other than to those in other groups (clusters).
How would you determine clusters?
Probably the most well-known method is the elbow method, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters.
How the K-means algorithm works
To process the learning data, the K-means algorithm in data mining starts with the first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids
It halts creating and optimizing clusters when either:
- The centroids have stabilized — there is no change in their values because the clustering has been successful.
- The defined number of iterations has been achieved.
Wikipedia says!
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
The problem is computationally difficult (NP-hard); however, efficient heuristic algorithms converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both k-means and Gaussian mixture modeling. They both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the Gaussian mixture model allows clusters to have different shapes.
Uses!
The k-means is used to find groups that have not been explicitly labeled in the data. This can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets. Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the correct group.
This is a versatile algorithm that can be used for any type of grouping. Some examples of use cases are:
- Behavioral segmentation:
- Segment by purchase history
- Segment by activities on application, website, or platform
- Define personas based on interests
- Create profiles based on activity monitoring
The Most waited for Algorithm K-means
Step 1: Import libraries
import pandas as p import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans %matplotlib inlined
Step 2: Generate random data
We will be generating random numbers (data) with the help of NumPy in python
X= -2 * np.random.rand(100,2 X1 = 1 + 2 * np.random.rand(50,2) X[50:100, :] = X1 plt.scatter(X[ : , 0], X[ :, 1], s = 50, c = ‘b’) plt.show()
This will be a sample output
Step 3: Use Scikit-Learn and find the centroid
from sklearn.cluster import KMeans Kmean = KMeans(n_clusters=2) Kmean.fit(X) Kmean.cluster_centers_
Step 4: Display
plt.scatter(X[ : , 0], X[ : , 1], s =50, c=’b’ plt.scatter(-0.94665068, -0.97138368, s=200, c=’g’, marker=’s’) plt.scatter(2.01559419, 2.02597093, s=200, c=’r’, marker=’s’) plt.show())
here is some sample output
Here is Full Code for Future References
import pandas as p import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans %matplotlib inline X= -2 * np.random.rand(100,2) X1 = 1 + 2 * np.random.rand(50,2) X[50:100, :] = X1 plt.scatter(X[ : , 0], X[ :, 1], s = 50, c = ‘b’) plt.show() from sklearn.cluster import KMeans Kmean = KMeans(n_clusters=2) Kmean.fit(X) Kmean.cluster_centers_ plt.scatter(X[ : , 0], X[ : , 1], s =50, c=’b’) plt.scatter(-0.94665068, -0.97138368, s=200, c=’g’, marker=’s’) plt.scatter(2.01559419, 2.02597093, s=200, c=’r’, marker=’s’) plt.show()d
Video Blog
References
Google Colab Test Excution Link