Image Captioning

Hamza Gabajiwala

Published Jul 21, 2019

Author Hamza. Gabajiwala, BTech Integrated, MPSTME, NMIMS

Under the Guidance of Dr. Seema. Shah

Deep learning is a very widespread field right now with technological progress increasing day by day it makes our tasks easier such as:-

Colorization of Black and white images.
Adding sounds to silent movies.
Automatic machine translation.
Object classification in photographs.
Automatic Handwriting Generation.
Character text generation.
Image caption generation.
Automatic game playing.

In this article, we’ll be focusing on Automatic Image Caption Generation where an image is given and the system must generate a caption that describes the contents of the image. Once the system detects a set of objects that define the image the next step is to combine the labels of the image and convert it into a coherent sentence description. Image captioning is widely used in Instagram and Facebook when an image cannot be displayed because of a slow internet connection.

Why we use image captioning?

They are used in a variety of applications:

In web development, it’s a good practice to provide a description for any image that appears on the page so that an image can be read or heard as opposed to just being seen. This makes the web content accessible.
It can be used to describe the video in real-time.
It can be used to describe images to people who are blind or have low vision and who rely on sounds and texts to describe a scene.

Image captioning model relies on two main components:

Convolutional Neural Network excels at preserving spatial information and images.
RNN works well with any kind of sequential data, such as generating a sequence of words. So by merging the two, you can get a model that can find patterns and images, and then use that information to help generate a description of those images.

Convolutional Neural Network

Image captioning works on a convolutional neural network which is a class of deep neural networks most commonly applied to analyzing visual imagery. A CNN consists of an input and an output layer, as well as multiple hidden layers. The hidden layers of a CNN typically consist of a series of convolutional layers that convolve with multiplication or other dot product. When programming with a CNN, each convolutional layer within a neural network should have the following attributes:

Input is a tensor with shape [no of images] X [image width] X [imag height] X [image depth]
Convolutional kernels whose width and height are hyper-parameters, and whose depth must be equal to that of the image.

What is Image captioning?

Image captioning is the process of providing a straightforward description of the image. It uses both Natural Language Processing and Computer Vision to generate these captions.

The data set for generating an image caption will be in the form of an image -> captions. The data set consists of input images and their corresponding output captions.

Encoding

The convolutional neural network can be thought of as an encoder. The input image is given to the convolutional neural network to extract the features. The last hidden state of the convolutional neural network is connected to the decoder.

Decoding

The decoder is a Recurrent Neural Network which does language modeling up to the word level. The first time step receives the encoded output from the encoder and also the vector.

Training

The output from the last hidden state of the convolutional neural network is given to the first time step of the decoder. We set x1 =[START] vector and the desired label y1=first word in the sequence. Analogously, we set x2 = word vector of the first word and expect the network to predict the second word. Finally, on the last step, xT = last word, the target label yt =[END] token. During testing, the output of the decoder at the time t is fed back and becomes the input of the decoder at time t+1.

Citations

To view or add a comment, sign in

Image Captioning

Hamza Gabajiwala

More articles by Hamza Gabajiwala

Explore content categories

More articles by Hamza Gabajiwala

Topic-oriented image captioning

Visual Question Answering

An Overview of Deep Learning

Explore content categories