MACHINE LEARNING ALGORITHMS - Instance Based (CAKE BASED, MEMORY BASED) Part 3 of 12

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classificationand regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

  • Ink-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
  • Ink-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.

k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.

Both for classification and regression, it can be useful to assign weight to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor.

The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.

A shortcoming of the k-NN algorithm is that it is sensitive to the local structure of the data. The algorithm has nothing to do with and is not to be confused with k-means, another popular machine learning technique.

             We want to know whether the query point can be classified as a plus or a minus sign.

           

 

  Let's consider the outcome of KNN based on 1-nearest neighbor. It is clear that in this case KNN will predict the outcome of the query point with a plus (since the closest point carries a plus sign). Now let's increase the number of nearest neighbors to 2, i.e., 2-nearest neighbors. This time KNN will not be able to classify the outcome of the query point since the second closest point is a minus, and so both the plus and the minus signs achieve the same score (i.e., win the same number of votes). For the next step, let's increase the number of nearest neighbors to 5 (5-nearest neighbors). This will define a nearest neighbor region, which is indicated by the circle shown in the figure above. Since there are 2 and 3 plus and minus signs, respectively, in this circle KNN will assign a minus sign to the outcome of the query point.

LEARNING VECTOR QUANTIZATION (LVQ)

• Recall that a Kohonen SOM is a clustering technique, which can be used to provide insight into the nature of data. We can transform this unsupervised neural network into a supervised LVQ neural network.


• The network architecture is just like a SOM, but without a topological structure.


• Each output neuron represents a known category (e.g. apple, pear, orange).

• Input vector 

 

 

• Weight vector for the jth output neuron

 


•C base j = Category represented by the jth neuron. This is pre-assigned.


• T = Correct category for input x base

• Define Euclidean distance between the input vector and the weight vector of the jth neuron as:

 

LVQ Training Algorithm:

 

 

 

 

 

 

 

 

 

 

 

Self-Organizing Map (SOM)

The Self-Organizing Map is one of the most popular neural network models. It belongs to the category of competitive learning networks. The Self-Organizing Map is based on unsupervised learning, which means that no human intervention is needed during the learning and that little needs to be known about the characteristics of the input data. We could, for example, use the SOM for clustering data without knowing the class memberships of the input data. The SOM can be used to detect features inherent to the problem and thus has also been called SOFM, the Self-Organizing Feature Map.

The Self-Organizing Map was developed by professor Kohonen. The SOM has been proven useful in many applications.

The SOM algorithm is based on unsupervised, competitive learning. It provides a topology preserving mapping from the high dimensional space to map units. Map units, or neurons, usually form a two-dimensional lattice and thus the mapping is a mapping from high dimensional space onto a plane. The property of topology preserving means that the mapping preserves the relative distance between the points. Points that are near each other in the input space are mapped to nearby map units in the SOM. The SOM can thus serve as a cluster analyzing tool of high-dimensional data. Also, the SOM has the capability to generalize. Generalization capability means that the network can recognize or characterize inputs it has never encountered before. A new input is assimilated with the map unit it is mapped to.

The Self-Organizing Map is a two-dimensional array of neurons:

M = m base {1,2,3.....pxq}

This is illustrated in Figure. One neuron is a vector called the codebook vector

m base i = m base [i1,i2,i3,i4.......in]

This has the same dimension as the input vectors (n -dimensional). The neurons are connected to adjacent neurons by a neighborhood relation. This dictates the topology, or the structure, of the map. Usually, the neurons are connected to each other via rectangular or hexagonal topology. In the Figure the topological relations are shown by lines between the neurons.

     Different topologies

One can also define a distance between the map units according to their topology relations. Immediate neighbors (the neurons that are adjacent) belong to the neighborhood  N base (c) of the neuron m base (c)   . The neighborhood function should be a decreasing function of time: N base (c)= N base (c) (t)  . Neighborhoods of different sizes in a hexagonal lattice are illustrated in Figure below. In the smallest hexagon, there are all the neighbors belonging to the smallest neighborhood of the neuron in the middle belonging to a hexagonal lattice. The topological relations between the neurons are left out for clarity.

In the basic SOM algorithm, the topological relations and the number of neurons are fixed from the beginning. This number of neurons determines the scale or the granularity of the resulting model. Scale selection affects the accuracy and the generalization capability of the model. It must be taken into account that the generalization and accuracy are contradictory goals. By improving the first, we lose on the second, and vice versa.

     

 

 

 

Neighborhood of a given winner unit

Locally Weighted Regression

 

Locally weighted regression (LWR) tries to alleviate this problem by assigning weights to your training data. Weights are bigger for the data points that are closer to the data you are trying to predict, thus local in the name. Another thing to note, LWR requires the entire data set every time you try to make a prediction making it much more computationally expensive compared to the simple linear regression. We have to do this because every time we try to make a prediction we are constructing a regression line that’s local to the data point of our interest.

 

LWR model is very similar to simple regression model, the only difference is that we are introducing a weight matrix ​WW. Once we have the weight matrix we can find the model parameters as follows:

β=(X′WX)−1X′WYβ=(X′WX)−1X′WY

To get the prediction we need to multiply betas with our inputs ​X0X0, ​y^=βX0y^=βX0.

Kernel Smoothing To Find Weights

Kernel smoothing allows us to to smooth out our point of interest (​X0X0) using nearby data points (​XX). The kernel is usually defined by a function ​DD whose values are decreasing as as the distance between ​X0X0 and ​XX increases:

D=∥X−X0∥hλ(X0)D=∥X−X0∥hλ(X0)

To construct matrix ​WW we need to evaluate kernel for each training input (​XX) and the value we are trying to predict, ​X0X0. The weight matrix is diagonal where all non-diagonal cells are 0. After running the kernel on our data set we will essentially create a weight matrix where the weights are decreasing as the distance between ​X0X0 and ​XX increases. This kernel property will make nearby data points more important when solving linear regression.

Gaussian Kernel

I am going to use the Gaussian kernel function for illustration purposes:

D=ae−∥X−X0∥2cD=ae−∥X−X0∥2c

Data Set

To demonstrate LWR we are going to use the same Iris data set we used in my previous post on linear regression. We will try to estimate sepal length from its width.

Ref : http://homepages.inf.ed.ac.uk/svijayak/publications/schaal-ICRA2000.pdf

Part -4 please click this link :

To view or add a comment, sign in

More articles by Abhay Kumar

Explore content categories