Large-Scale Video Classification
Convolutional neural networks represent a powerful model class for image recognition because of the prevalence of images and videos on the Internet. Many algorithms have been developed to analyse the semantic content of image and video for various applications, including for search and summary. Associated research objectives in this area include evaluating convolutional neural networks for large-scale video classification. Against a new data set of 1 million YouTube videos belonging to over 400 classes.
Previous related work in this field involved a three-stage standard approach to video classification. In the first stage, local visual features describing a region of video are extracted. Then in the second stage, features are combined into a fixed-sized video level description. And then finally, in the third stage, a classifier, for example, a support vector machine, is trained on the resulting bag of words to distinguish between the relevant visual classes. Convolution neural networks are biologically inspired classes of deep learning models that replace the three stages with a single neural network.
But there had been, to this point, relatively limited work on using convolution neural networks for video classification. In consideration of the types of research models used for this research, each video is treated as a collection of short fixed-size clips. There exist several options for extended connectivity and in particular, the research models use three general connectivity pattern groupings. Early, late, and slow fusion of information.
Recommended by LinkedIn
Multiresolution convolutional neural networks are used and achieve a compromise between training time and performance by using two streams of processing, the fovea and context streams. The fovea generally for visual acuity and color discrimination, and a context stream for the context. This results in manageable learning optimization while taking advantage of data augmentation and pre-processing, which reduced the effect of overfitting.
Results showed that convolutional neural networks architectures can learn powerful features from data that is weakly-labelled. And this greatly exceeds the performance of typical feature-based methods. It was also found that the slow fusion model consistently performed best.
And in the end, the results demonstrate that mixed resolution architecture consisting of a low-resolution context and a high-resolution fovea stream effectively sped up convolutional neural networks without sacrificing accuracy.
Looks very good Satesh. Do you do frame by frame object Identification and combine it with audio to text conversion and then follow up with subject meta data creation? If so then this seems to combine both picture and text ML using Natural Language Processing algorithms.