Classification algorithm for official documents based on computer vision

Abstract

Artificial intelligence is the branch of computer science that deals with machines that are capable of performing tasks that usually require human intelligence. Its main purpose is to analyze big quantities of data, pattern identifications and making predictions all of this for simplifying human work. It has several branches including machine learning, agents and computer vision. The computer vision branch is really important because of image recognition and classification. Classification highly depends on image resolution and quality and the main challenges of a classification algorithm are:

  • Accuracy: The percentage of the images classified correctly
  • Speed: The velocity of recognition.

That is why in this research we will be exploring several algorithms and techniques for image classification and representation, determining the optimal one for classification of official documents.

Background

First of all we have to talk about the difference between classification algorithm and technique in this paper. We will refer as classification techniques to all the pre-processing operations done to the images to improve accuracy or speed changing image properties without changing the classification program itself. Classification algorithms will be what the program itself is doing in order to classify the images into different categories.

After that we have to talk about images and how they are represented in the computer. Images are represented as a matrix in the computer, they are composed of rows and cols, and every pixel has a color value. With this information it is possible for the computer to obtain color percentages, cols and rows arrangements, among others.

Representation of an image as a matrix

Also, it is possible to make mathematical operations to the image such as multiplication to simulate color filters, black and white and even standardize their resolution.Then we have to know about concurrent programming. This paradigm is the best one for making operations in the matrix because you can divide the operations that need to be done into multiple threads, each thread is independent from the others with a main thread that receives all the information already processed from the others. One core can run a single thread at the same time, so this paradigm does not seem that useful, but most of the processors nowadays are multi-core, so the operations in the matrix will be at least 4 times faster than in another paradigm. It's important to know the two kinds of processing units GPU and CPU, usually a GPU have at least 3000 cores, on the other hand a CPU usually has 4 cores, so GPU operations are faster, but also it's a more expensive resource. In the next image it is clear how concurrent programming would look in a single core and a multi-core processor, the concurrent programming would in fact be parallel by the number of cores in your processor, looking like the parallel part of the image.

Parallel vs Concurrent

In the next example we can see how an 8x4 matrix would be processed by the different threads (1 to 4). First the rows will be divided for the number of threads, and then the rows are assigned to each thread.

4 threads in a matrix

Finally we have to get familiar with euclidean distance, it is the distance between two points in either the plane or in a three dimensional space. Can also be used with vectors of the same length, representing the difference between them, the greater the distance, the greater the difference between them, with this it is possible to measure the difference between images because a matrix can also be represented as a vector.

Problem statement

In this technological era pictures carry a lot of information, the more we can learn about an image the better, but the bigger the image sample is, the more difficult it is to analyze one by one. When a company has clients, they usually ask for a lot of documents, identity, address, sometimes banks. If a company is small, it is really easy to handle and search them, but as the company starts to grow, the number of clients they store can grow from tens to thousands of clients, this means thousands of documents. Sometimes this can act as a barrier for the organization to keep growing, because it is really difficult to rearrange all these past documents that probably have been there before many of the employees. Sometimes an organization can't afford to spend as much time as needed to fix this, even if it represents staying with outdated systems that aren't optimized.

With this in mind the classification algorithm project was born, using computer vision, the idea is to train an algorithm, at first, to recognize between different identity documents, and if it is accomplished in time, to make it able to train for any given set. With these algorithms companies could save a lot of money because instead of paying people to accomplish this repetitive task, they could run the program on a single computer achieving the same goals in less time. With the documents organized, companies can renew their systems without the risk of losing them, or spending a huge amount of time out of service because of document migration.

The main challenge in this project is the length of the data sample because we are experimenting with official documents that people do not post online, so we have to make an efficient algorithm that doesn't have to train with huge datasets.

Related works

Pixel re-representations for better classification of images

The main purpose of this paper is to analyze how pixel re-representation may lead to a better classification result, it explores the possibility that higher resolution images don't lead to better classification because they contain too much information, especially if the study sample was a lower resolution image. Also explains how this technique shortens the range of the original image lowering its resolution. This could lead to faster results due to the amount of pixels needed to analyze. With these advantages i think it's a very promising pre processing technique because it improves accuracy and speed in the final algorithm.

Junqian Wang , Hanyu Zhang , Peiyi Han , Chuanyi Liu , Yong Xu , Pixel re-representations for better classification of images, Pattern Recognition Letters (2020), doi: https://doi.org/10.1016/j.patrec.2020.04.027

Discriminative block-diagonal covariance descriptors for image set classification

This paper it's also about image pre-processing to achieve a better classification of images. They try to obtain an effective and efficient representation of an image and then define how to measure the similarity between them. They explore a method for representing image sets based on block-diagonal Covariance Descriptors (CovDs). The technique consists in dividing each image into square blocks of the same size, and then to compute the corresponding block CovDs instead of the global one. Taking the relative discriminative power of these block CovDs into account, a block-diagonal matrix can be constructed to achieve a better discriminative capability. With this information, I think it is a solid option for improving speed and accuracy in this research, because all the resulting blocks are of the same size and this can lead to faster and better matrix comparisons.

Jieyi Ren, Xiao-jun Wu, Josef Kittler, Discriminative block-diagonal covariance descriptors for image set classification, Pattern Recognition Letters (2020)

Comparative Analysis of Image Classification Algorithms Based on Traditional Machine Learning and Deep Learning

This paper compares and analyzes machine learning and deep learning image classification algorithms. Machine learning is the branch that simulates human learning behavior, learning from historical data and samples to make decisions. On the other hand deep learning simulates a human brain for analysis, learning and decision making, usually working with several layers for data interpretation. The results of this experiment shows that machine learning has a better accuracy in smaller datasets than deep learning, but with bigger datasets it is otherwise.

With this information I learned that the optimal technique depends on the amount of data analyzed, but another important thing to be considered its speed. In my research I would need to include speed comparisons and data sample size to make the optimal decision.

Pin Wang , En Fan , Peng Wang , Comparative Analysis of Image Clas- sification Algorithms Based on Traditional Machine Learning and Deep Learning, Pattern Recognition Letters (2020), doi: https://doi.org/10.1016/j.patrec.2020.07.042

Google teachable machine 

Teachable Machine is a web-based tool that makes creating machine learning models fast, easy, and accessible to everyone. They have models for images, poses and sounds. It uses Tensorflow.js, a javascript library that allows machine learning in the explorer and can be exported to tensorflow (python library) and tensorflow (android). These models use a technique called transfer learning. There’s a pre trained neural network, and when you create your own classes, you can sort of picture that your classes are becoming the last layer or step of the neural net. Specifically, both the image and pose models are learning off of pre-trained mobilenet models, and the sound model is built on Speech Command Recognizer.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. Dept. of Computer Science, Princeton University, USA.

Google. (2019). Teachable machine. Google: https://teachablemachine.withgoogle.com

Definition of objectives

General objective

Analyze advantages and disadvantages of different classification algorithms and techniques to implement the optimal algorithm for official document classification based on artificial vision.

Specific objectives 

  1. Identify 2 classification algorithms of images based on artificial vision 
  2. Identify 2 classification techniques to improve accuracy or speed
  3. Design and implement the optimal solution with possible techniques
  4. Perform the necessary tests
  5. Report about identified algorithms and techniques used for understanding the benefits above the others. 

Prototype design

Classification techniques

The first technique that we will be exploring is image normalization, mentioned in the investigation of “Pixel re-representations for better classification of images”, this technique explains that if all the images in the data set have the same resolution, they will look more alike to the computer because of the pixel distribution. The main purpose of this technique is to have a standardized resolution, not really big and not really small images, that will help the computer to classify them correctly even if the data set is not too big. This technique is easy to implement alongside the classification algorithm as a prerequisite.

The second technique that we will be exploring is the pixel re representation, also mentioned in the investigation of “Pixel re-representations for better classification of images”, this technique explores that a pixel re representation in an image can lower the difference between images of the same class an wider the difference between images of different classes. To achieve the studied re representation suppose that G is an original image, and

Gij is the pixel value at the i-th row and j-th column. The re representation is produced with the formula Hij = (Gij ) ^a , where 0 < a ≤ 1. After that, the calculation of the euclidean distance is performed and I expect that the distance between different samples of the same class of images to be shorter than the original one. This technique is mostly used with objects of changeable appearances, like faces in documents, when they are re-represented the faces start looking more alike and the algorithm can focus on the document itself. For this technique I will be using a concurrent paradigm for faster results in images and capacity to run in any computer.

Classification algorithms

For the classification algorithms we will be exploring google teachable machine implementations. As we explained in the problem statement section, the main challenge in this investigation is to gather the data sample of official documents because it is not open information. This deep learning algorithm is pre-trained and in the documentation it is said that a data sample of 400 samples usually it's enough for the resulting graph to have an accuracy of 98% or higher. For the data I will take several pictures of my own documents with different cameras for changing the quality, different lights and different perspectives to get a large data sample. Then I will be checking the effectiveness with volunteer’s documents. Google teachable machine can export the graphs in many formats, we will explore them to give the user installation and hardware options.

Prototype implementation

Pixel re representation

I will implement the pixel re representation in Java, because it can use and create threads really effectively, it is easy to install on any computer, and I only used default libraries. Java runs in a virtual machine so you can run it in any environment.

The tests will be done with images of different sizes and colors to know how the threads react to more or less charge.

The java version that will be used for the project is:

openjdk 11.0.9 2020-10-20

OpenJDK Runtime Environment (build 11.0.9+11-Ubuntu-0ubuntu1.18.04.1)

OpenJDK 64-Bit Server VM (build 11.0.9+11-Ubuntu-0ubuntu1.18.04.1, mixed mode, sharing)

The program will receive the name of the image as an input in the command line, processes it and then outputs a new image writing it in the same directory.

Teachable machine 

For the export result of the training model I selected tensorflow keras, this library is native from python, it is said to be a high level API for machine learning that runs in tensorflow, it is approachable and highly productive and scalable and the model can be downloaded locally. The main advantages of keras are:

  • Efficiently executing low-level tensor operations on CPU, GPU, or TPU.
  • Computing the gradient of arbitrary differentiable expressions.
  • Scaling computation to many devices
  • Exporting programs ("graphs") to external run times such as servers, browsers, mobile and embedded devices.

Keras can use two types of models, the simplest type of model is the Sequential model, a linear stack of layers. For more complex architectures, Keras uses the functional API, which allows to build arbitrary graphs of layers, or write models entirely from scratch via sub-classing. The graph that will be used is a Keras functional API model, it is pre trained with 12 sub trees, that consist of a total of 3.2 million cleanly annotated images spread over 5247 categories, an average of 600 images are collected for each subset. At the end of this tree our new nodes will be created, obtaining an .h5 file with two results, INE and Passport.

For running the project I will used python libraries:

Tensorflow version: 2.3.1

Numpy version: 1.19.2

Keras version: 2.4.3

First of all the program will transform all input images to a normalized vector of 224x224 pixel values, the classification will be performed with this vector but the resolution of the output image will be the same as the input.

Then the program will receive as a parameter from the command line the directory containing all JPG images to classify, and then list all of them, the result of this process is an array with numbers between 0 and 1, representing percentages of probability of belonging to a certain class. For the algorithm to accept an image for a certain class, the result for that class has to be greater than .9 representing a 90% accuracy, if the image does not obtain this percentage in any class it will be classified as unknown. the results will be shown changing the image name to class+name, class referring to the result of the classification. For tests I will be using blurred, rotated, cropped images to measure the precision with low information.

Tests and results

Pixel re representation

For the testing environment I used a machine with Ubuntu 18.0.3, core intel i7 6th generation with 8 cores (Java will create one thread per available core) and a version of Java:

openjdk "11.0.8" 2020-07-14

OpenJDK Runtime Environment (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1)

OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1, mixed mode, sharing).

The results were as expected, for this technique I tested two classes, IFE and Passport and two different jpeg images of each class.

Time improvement

The first test consisted in checking the improvement of the multi-thread algorithm against the single thread. The improvement was only on the section of the code that processed the image, the I/O was performed by the main thread, even though a great improvement was achieved, specially with bigger images.

In this first example an image of resolution 6000x4000 was used for the test, this reduced the processing time almost at half achieving an improvement of 45%.

Graphic of improvement

For this second example, an image of resolution 3024x3024 was used, achieving an improvement of 42% in time.

No alt text provided for this image

This first experiment led me to conclude that bigger images achieve better results when using multi thread because the I/O operations represent a smaller percentage of the total time.

Euclidean distance

The second experiment of this technique was also successful, the euclidean distance was reduced considerably between images of the same class achieving less variation between images to have a good performance with a smaller data sample. 

The next examples are representations of an INE, the fist one is the original, The second one is a re-representation with a = .90 and the third one is a re-representation with a=.80

No alt text provided for this image

With this three representations the next comparison table was built:

No alt text provided for this image

As we can see in this table both the distance within a class and the distance between classes shortens with each representation, leading us to conclude that the smaller the a, the more similar that the images get. But we can see that the ratio of the difference starts getting bigger, this is because the difference between classes begins to decrease. This leads us to conclude that this technique is really effective to our data set for predicting if objects are the same, but for differentiating between objects is not very reliable.

Teachable machine 

Accuracy of classification graph

For the testing environment of accuracy we will be using the web implementation of “teachable machine”, testing the resulting algorithm with different inputs and getting the result. The optimal trained algorithm will be exported to Keras for the implementation in python.

The format of the passport image need to be opened in the photograph, and for the IFE an image of both sides as follow:

No alt text provided for this image

For privacy purposes I will only be showing sample documents in the report, even though the verification was made with several peoples documents. The algorithm was trained with mexican passports but the only sample I was able to find was a Canadian passport.

The results were better than expected, with just 400 samples of each class the algorithm was able to recognize perfectly between IFEs and Passports with a certainty of 99% in most cases. To prove the limits of the algorithm re representations of an IFE image were made to simulate lower resolution, with an alpha going from .4 to .9, in the next table we can observe the results.

No alt text provided for this image

The algorithm was capable of recognizing images with 100% certainty with just 60% of the data, lowering its certainty to 82%, an acceptable output for 50% of the data, and being unable to recognize below 40% of the data. The next images are .6, .5 and .4 alpha.

No alt text provided for this image

As we can see, these images have almost no information, it's really difficult even for humans to be certain of their class. 

The experiment was also made using tilted images, IFEs with the sides more separated, black and white, INEs and documents of both genders, everything obtaining a certainty above 98%.

For trying the algorithm yourself: https://teachablemachine.withgoogle.com/models/03WEerH9b/

Python implementation

For the testing environment I used a machine with Ubuntu 18.0.3, core intel i7 6th generation.

Python version 3.6.9

Tensorflow version: 2.3.1

Numpy version: 1.19.2

Keras version: 2.4.3

Loading speed

The first step was to analyze the time that the program takes to load the graph, I test it with 3 different graphs with different lengths of results.

No alt text provided for this image

The official documents graph has two possible results, the Pokemon 3 have three and the Pokemon 4 have four results. The graph shows that the loading time is linear with an average time of 3.75 seconds. This leads us to conclude that all graphs are really similar because the pre-trained model only adds nodes at the last layer and this is why the loading time of the models looks linear.

Classification speed

The second step was to analyze the speed of classification itself. Theoretically the classification time of one image will be linear, obtaining a time of O(n) for n images, this is because as a pre-processing technique image normalization is done, obtaining an squared image of 240x240 pixels.

No alt text provided for this image

In this graph we can observe the linearity of time, with each column being the double of the last one, in sample size and in time. In the next table the time per image is shown and we can observe that it is almost the same in the three cases.

No alt text provided for this image

Classification with volunteers data sample

The last step was to test the final program with a data sample built with some volunteers documents, this documents were altered to get bigger and variate samples, in case of passports, the samples were rotated from 0* to 360* adding 45* each time, also adding a black and white filter obtaining 16 edited images with just 1. For the IFEs the rotating process was similar but in intervals of 90*, also applying the black and white filter and in addition rotating the top and/or bottom part of the image obtaining a total of 20 images with just 1 sample.

The final test sample consisted in 22 passport images and 25 IFE images. The results are shown in the next table.

No alt text provided for this image

The algorithm worked really nice, just failing in a low resolution, rotated, black and white passport. These results prove that the algorithm is ready for production. In the future i would like to gather a more extensive data sample to get a 100% accuracy.

Conclusions

The research was a success, with just 400 samples of the same IFE and 400 samples of the same passport the algorithm was capable to classify correctly all the volunteers data samples, the cropped, rotated and blurred samples. With this program, businesses can migrate to new systems without the risk of spending too much time checking documents. The best part of it, is that anyone can get a .h5 graph from google teachable machine, and then use it with the project to obtain a classification algorithm for any given set. The next step is to optimize the program to use the GPU correctly, to get even faster results for people that are in a hurry.

References

Junqian Wang , Hanyu Zhang , Peiyi Han , Chuanyi Liu , Yong Xu , Pixel re-representations for better classification of images, Pattern Recognition Letters (2020), doi: https://doi.org/10.1016/j.patrec.2020.04.027

Jieyi Ren, Xiao-jun Wu, Josef Kittler, Discriminative block-diagonal covariance descriptors for image set classification, Pattern Recognition Letters (2020)

Pin Wang , En Fan , Peng Wang , Comparative Analysis of Image Clas- sification Algorithms Based on Traditional Machine Learning and Deep Learning, Pattern Recognition Letters (2020), doi: https://doi.org/10.1016/j.patrec.2020.07.042

Eduardo A.B. da Silva, Gelson V. Mendonça, in The Electrical Engineering Handbook, 2005

Lijun Sun, in Structural Behavior of Asphalt Pavements, 2016

Albert Wong, S.L. Lou, in Handbook of Medical Image Processing and Analysis (Second Edition), 2009

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. Dept. of Computer Science, Princeton University, USA.

Google. (2019). Teachable machine. Google: https://teachablemachine.withgoogle.com

IFE example: https://www.elsiglodetorreon.com.mx/noticia/956368.produce-ife-credenciales-con-domicilio-opcional.html

Passport example: https://care.paytm.ca/hc/en-us/articles/360039233774-Canadian-government-issued-ID-example

Implementation: https://github.com/MartinAntonio123/ImageClassificator











To view or add a comment, sign in

Others also viewed

Explore content categories