Machine Learning Algorithm - Deep Learning (Part 5 of 12)

Deep Boltzmann Machine (DBM) :

We present a new learning algorithm for Boltzmann machines that contain many layers of hidden variables. Data-dependent statistics areestimated using a variational approximation that tends to focus on a single mode, and data-independent statistics are estimated using persistentMarkov chains. The use of two quite different techniques for estimating the two types of statistic that enter into the gradient of the log likelihood makes it practical to learn Boltzmann machines with multiple hidden layers and millions of parameters. The learning can be made more efficient by using a layer-by-layer pretraining phase that initializes the weights sensibly. The pretraining also allows the variation inference to be initialized sensibly with a single bottom-up pass. We present results on the MNIST and NORB data sets showing that deep Boltzmann machines learn very good generative models of handwritten digits and 3D objects.

We also show that the features discovered by deep Boltzmann machines are a very effective way to initialize the hidden layers of feed forward neural nets, which are then discriminating fine-tuned.

We then show how to make our learning procedure for general Boltzmann machines considerably more efficient for deep Boltzmann machines (DBMs) that have many hidden layers but no connections within each layer and no connections between nonadjacent layers. The weights of a DBM can be initialized by training a stack of RBMs, but with a modification that ensures that the resulting composite model is a Boltzmann machine rather than a deep belief net (DBN). This pretraining method has the added advantage that it provides a fast, bottom-up inference procedure for initializing the mean-field inference. We use the MNIST and NORB data sets to demonstrate that DBMs learn very good generative models of images of handwritten digits and 3D objects. Although this article is primarily about learning generative models, we also show that the weights learned by these models can be used to initialize deep feed forward neural networks.

These feed forward networks can then be fine-tuned using back propagation to give much better discriminative performance than randomly initialized networks.

 Ref. : http://www.jmlr.org/proceedings/papers/v5/salakhutdinov09a/salakhutdinov09a.pdf

Deep Belief Networks (DBN) :

In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a type of deep neural network, composed of multiple layers of latent variables ("hidden units"), with connections between the layers but not between units within each layer.

When trained on a set of examples in an unsupervised way, a DBN can learn to probabilistically reconstruct its inputs. The layers then act as feature detectors on inputs. After this learning step, a DBN can be further trained in a supervised way to perform classification.

DBNs can be viewed as a composition of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, where each sub-network's hidden layer serves as the visible layer for the next. This also leads to a fast, layer-by-layer unsupervised training procedure, where contrastive divergence is applied to each sub-network in turn, starting from the "lowest" pair of layers (the lowest visible layer being a training set).

The observation, due to Yee-Whye Teh, Geoffrey Hinton's student, that DBNs can be trained greedily, one layer at a time, led to one of the first effective deep learning algorithms.

Training algorithm

The training algorithm for DBNs proceeds as follows.Let X be a matrix of inputs, regarded as a set of feature vectors.

  1. Train a restricted Boltzmann machine on X to obtain its weight matrix, W. Use this as the weight matrix between the lower two layers of the network.
  2. Transform X by the RBM to produce new data X', either by sampling or by computing the mean activation of the hidden units.
  3. Repeat this procedure with XX' for the next pair of layers, until the top two layers of the network are reached.
  4. Fine-tune all the parameters of this deep architecture with respect to a proxy for the DBN log- likelihood, or with respect to a supervised training criterion (after adding extra learning machinery to convert the learned representation into supervised predictions, e.g. a linear classifier).

Convolutional Neural Network

A Convolutional Neural Network (CNN) is comprised of one or more convolutional layers (often with a subsampling step) and then followed by one or more fully connected layers as in a standard multilayer neural network. The architecture of a CNN is designed to take advantage of the 2D structure of an input image (or other 2D input such as a speech signal). This is achieved with local connections and tied weights followed by some form of pooling which results in translation invariant features. Another benefit of CNNs is that they are easier to train and have many fewer parameters than fully connected networks with the same number of hidden units. In this article we will discuss the architecture of a CNN and the back propagation algorithm to compute the gradient with respect to the parameters of the model in order to use gradient based optimization. See the respective tutorials on convolution and pooling for more details on those specific operations.

Architecture

A CNN consists of a number of convolutional and subsampling layers optionally followed by fully connected layers. The input to a convolutional layer is a m x m x r

image where m is the height and width of the image and r is the number of channels, e.g. an RGB image has r=3. The convolution layer will have k filters (or kernels) of size n x n x q where n is smaller than the dimension of the image and q can either be the same as the number of channels r or smaller and may vary for each kernel. The size of the filters gives rise to the locally connected structure which are each convoluted with the image to produce k feature maps of size mn+1. Each map is then sub sampled typically with mean or max pooling over p x p contiguous regions where p ranges between 2 for small images (e.g. MNIST) and is usually not more than 5 for larger inputs. Either before or after the subsampling layer an additive bias and sigmoidal non linearity is applied to each feature map. The figure below illustrates a full layer in a CNN consisting of convolutional and subsampling sublayers. Units of the same color have tied weights.

First layer of a convolution neural network with pooling. Units of the same color have tied weights and units of different color represent different filter maps.

After the convolution layers there may be any number of fully connected layers. The densely connected layers are identical to the layers in a standard multi layer neural network.

Back Propagation

Let δ(l+1) be the error term for the (l+1)-st layer in the network with a cost function J(W,b;x,y) where (W,b) are the parameters and (x,y) are the training data and label pairs. If the l-th layer is densely connected to the (l+1)-st layer, then the error for the

l-th layer is computed as

δ(l)=((W(l))Tδ(l+1))∙f′(z(l))

and the gradients are

W(l)J(W,b;x,y) δ(l+1)(a(l))T

b(l)J(W,b;x,y)= δ(l+1)

If the l-th layer is a convolutional and subsampling layer then the error is propagated through as

δ (l) =upsample((W (l))Tδ (l+1))∙f′(z (l))

Where k indexes the filter number and f′ (z (l))

is the derivative of the activation function. The upsample operation has to propagate the error through the pooling layer by calculating the error w.r.t to each unit incoming to the pooling layer. For example, if we have mean pooling then upsample simply uniformly distributes the error for a single pooling unit among the units which feed into it in the previous layer. In max pooling the unit which was chosen as the max receives all the error since very small changes in input would perturb the result only through that unit.

Finally, to calculate the gradient w.r.t to the filter maps, we rely on the border handling convolution operation again and flip the error matrix δ (l)

the same way we flip the filters in the convolutional layer.

∇(l)J(W,b;x,y)=∑ i=1(a(l))∗rot90(δ (l+1),2),

b(l)kJ(W,b;x,y) =∑(δ (l+1))a,b.

Where a(l) is the input to the l-th layer, and a(1) is the input image. The operation (a (l))∗δ k is the “valid” convolution between i-th input in the l-th layer and the error w.r.t. the k-th filter.

 Stacked Auto-Encoders :

The greedy layerwise approach for pretraining a deep network works by training each layer in turn. In this page, you will find out how autoencoders can be "stacked" in a greedy layerwise fashion for pretraining (initializing) the weights of a deep network.

A stacked auto encoder is a neural network consisting of multiple layers of sparse auto encoders in which the outputs of each layer is wired to the inputs of the successive layer. Formally, consider a stacked auto encoder with n layers. Using notation from the auto encoder section, let W(k,1),W(k,2),b(k,1),b(k,2) denote the parameters W(1),W(2),b(1),b(2) for kth auto encoder. Then the encoding step for the stacked auto encoder is given by running the encoding step of each layer in forward order:

The decoding step is given by running the decoding stack of each autoencoder in reverse order:

The information of interest is contained within a(n), which is the activation of the deepest layer of hidden units. This vector gives us a representation of the input in terms of higher-order features.

The features from the stacked auto encoder can be used for classification problems by feeding a(n) to a softmax classifier.

Ref : http://ir.lib.uwo.ca/cgi/viewcontent.cgi?article=3503&context=etd

To view or add a comment, sign in

More articles by Abhay Kumar

Explore content categories