How to transform a supervised problem into an unsupervised one?
Supervised learning uses labeled input and output data while unsupervised learning doesn't have labeled input data. Although supervised learning problems are easier to tackle than unsupervised learning, this is just an example to show the relationship between both.
Here we have MNIST data that includes the 28x28 pixel images of digits and the goal is to predict the target i.e. the digit by looking at the images.
from sklearn import datasets
digits = datasets.load_digits()
_, axes = plt.subplots(nrows=1, ncols=10, figsize=(15, 6))
for ax, image, label in zip(axes, digits.images, digits.target):
ax.set_axis_off()
ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest"))
let's do some dimension reduction using t-distributed Stochastic Neighbour Embedding (t-SNE) and plot the data as a scatter plot for 4000 random digits and dimensions as two.
data = datasets.fetch_openml('mnist_784', version=1, return_X_y=True)
pixel_values, targets = data
targets = targets.astype('int'))
tsne = manifold.TSNE(init='random', learning_rate='auto', n_components=2, random_state=42)
transformed_data = tsne.fit_transform(pixel_values.loc[:4000, :].values))
tsne_df = pd.DataFramenp.column_stack((transformed_data, targets.loc[:3000].values)),
columns=["x", "y", "targets"])
tsne_df.loc[:, "targets"] = tsne_df.targets.astype(int)
tsne_df.head(10)
grid = sns.FacetGrid(tsne_df, hue="targets", height=8)
grid.map(plt.scatter, "x", "y").add_legend()
following is the plot of supervised data with labels
Recommended by LinkedIn
grid = sns.FacetGrid(tsne_df, hue="targets", height=8)
grid.map(plt.scatter, "x", "y").add_legend())
following is the plot without labels
There are some clusters that can be seen in the data, let's apply some unsupervised clustering algorithm on this data and see how it will create clusters.
from sklearn.cluster import KMean
df = pd.DataFrame(
np.column_stack((transformed_data,)),
columns=["x", "y"]
)
kmeans = KMeans(n_clusters= 10)
label = kmeans.fit_predict(df)
df_with_target = pd.DataFrame(
np.column_stack((transformed_data, label)),
columns=["x", "y", "targets"]
)
df_with_target.loc[:, "targets"] = df_with_target.targets.astype(int)
grid = sns.FacetGrid(df_with_target, hue="targets", height=8)
grid.map(plt.scatter, "x", "y").add_legend()
There is one drawback the target values are random and don't indicate the actual label of the digit, so labeling needs to be changed after analyzing some of the data.
The k-Means clustering algorithm is able to create the clusters, the accuracy of this algorithm may not be good in comparison to the supervised algorithms because supervised models are trained on the labeled data. Hence we solve a supervised learning problem with unsupervised algorithms.
What a brilliant illustration 💡