Using Computer Vision to Prepare Images for Text Extraction

Using Computer Vision to Prepare Images for Text Extraction

They say that a picture is worth a thousand words. That maybe true, that is if you can see.

On the other hand if your job is data entry and if you have a stack of reports full of tables and if all those tables need to be transferred to excel you might prefer the words. As crisp and electronic as possible.

We would like to show you how pictures are transformed into text that can be used for machine learning and more.

In this process the input is a picture of a document, scanned in or taken by a mobile phone. The contents can be a business card, an invoice or a financial report.

The overall process of transforming a picture to text involves two key elements:

  1. Computer Vision or CV
  2. Optical Character Recognition or OCR

We use methods developed in computer vision to prepare the picture. This includes color transformation, contrast enhancement, noise reduction, rectification and binarization.

If all goes well the output of pre-processing and also the input to optical character recognition is a cleaned up, monochrome picture where the white foreground is the text we want to extract and everything else is black background.

Such processed image is then fed into OCR. The OCR algorithm uses additional computer vision methods to segment foreground into letters and uses pattern matching to decide which letters they are.

The output of OCR is a list of rectangles and detected text. Depending on the recognition granularity the rectangle can be per letter, per word, per line or per whole page. The finer the granularity the more information about the layout is retained. If content is a table then granularity must be at level of words or letters to be able to extract columns and rows.

In the first post of this series we will tackle pre-processing of document images using Computer Vision. The OCR will be treated in the second part.

Rectify: Using Computer Vision

In the original post you can find detailed instructions on setting up the software environment used in this article/tutorial.

Once the environment is set up we are ready to use OpenCV and Tesseract methods in Java. Let us start with a sample image:

No alt text provided for this image

Our first piece of code is to load the image and to convert it to grayscale. We do this so that both gray and color images are on the same level and to simplify further steps.

import org.opencv.core.*;
import org.opencv.imgcodecs.Imgcodecs;
import org.opencv.imgproc.Imgproc;
 
class Textify{
    static {
        try {
            System.loadLibrary(Core.NATIVE_LIBRARY_NAME);
        } catch (IOException ex) {
            java.util.logging.Logger.getLogger(Textify.class.getName()).log(Level.SEVERE, null, ex);
        }
    }
    // main code goes here
    public static void main(String...arg) throws IOException{
        String inputFile="data/bcard.jpg";
        // load input image
        Mat in=Imgcodecs.imread(inputFile);
        // convert to grayscale
        Mat gray = new Mat(in.size(),CvType.CV_8UC1);
        if(in.channels()==3){
            Imgproc.cvtColor(in, gray, Imgproc.COLOR_RGB2GRAY);
        }else if(in.channels()==1){
            in.copyTo(gray);
        }else{
            throw new IOException("Invalid image type:"+in.type());
        }
    }
}        

Our code so far calls "Imgcodecs.imread()" to load an image into a variable of type "Mat". OpenCV variable type "Mat" represents 2D data such as images and matrices. It has a size expressed in rows and columns and it has a type that encodes data format. Common formats are single channel images, RGB images, floating point matrices and a variety of other formats that a 2D matrix can represent.

After the image is loaded we proceed to allocate an additional matrix named "gray" of the same size and in the format "CvType.CV_8UC1" which encodes single channel, 8-bit per channel, unsigned data.

If the input image has three channels we assume it is an RGB color image and we call "Imgproc.cvtColor()" to convert it to gray scale. If the number of channels is one we just copy the input image to gray variable.

At this point the resulting image looks like this:

No alt text provided for this image

So far we have introduced the most important OpenCV data structure the matrix "Mat" and a few important methods to work with it.

We will now take it up a notch. The image is of a business card and it was taken under an angle. As it is, the text is distorted and the OCR system will have trouble detecting text.

We need to undo the perspective distortion or as I like to call it we need to "rectify" the image.

To rectify the image we need to locate the corners of the business card and warp the pixels so that document corners end up in the corners of the resulting image.

The first step is to find the contours. To that end we produce a binary image that separates the card from the background and is more suitable for contour finding algorithm.

// we first blur the gray scale image 
Mat blurred = new Mat(gray.size(),gray.type()); 
Mat binary4c = new Mat(gray.size(),gray.type()); 
Imgproc.GaussianBlur(gray,blurred,new Size(20,20),0); 
// next we threshold the blurred image 
 float th=128; 
Imgproc.threshold(blurred,binary4c,th,255,Imgproc.THRESH_BINARY);        

The resulting binary image might be slightly different depending on the threshold value but looks approximately like the following:

No alt text provided for this image

Because we used blurring the smaller details were "white washed" but the contours are a little rounded.

The final step is to find the contours.

Mat mHierarchy = new Mat(); 
List<MatOfPoint> contours = new ArrayList<MatOfPoint>(); 
Imgproc.findContours(binary4c, contours, mHierarchy, Imgproc.RETR_LIST,Imgproc.CHAIN_APPROX_SIMPLE);        

The method "Imgproc.findContours()" will return a list of contours with each contour a polygon of 2D points. Out of these contours we need to pick the one that belongs to the document.

Once the contour is located the corners need to be detected. How to accomplish that is not trivial but not too complex either. It all depends on how robust you want the method to be.

When the corner detection is done we get the following end result. The contour and the corners are highlighted in color.

No alt text provided for this image

With the corners established we can now proceed to undo the projective distortion.

int dpi=300;    // select dpi, 300 is optimal for OCR
List&lt;Point&gt; corners=getOuterQuad(binary4c); // find corners from contour
Point inchesDim=getFormatDim();  // select business card dimensions and compute pixels
float inchesWide=inchesDim.x;
float inchesHigh=inchesDim.y;
pixelsWide=(int)(inchesWide*dpi); // width and height of business card at given dpi
pixelsHigh=(int)(inchesHigh*dpi);
// now establish from and to parameters for warpPerspective
Point[] fromPts = {corners.get(0),corners.get(1),corners.get(2),corners.get(3)};
Point[] toPts = {new Point(0,0), new Point(0,pixelsHigh), new Point(pixelsWide,pixelsHigh), new Point(pixelsWide,0)};
MatOfPoint2f srcPts = new MatOfPoint2f(); srcPts.fromArray(fromPts);
MatOfPoint2f dstPts = new MatOfPoint2f(); dstPts.fromArray(toPts);
Mat hh=Calib3d.findHomography(srcPts,dstPts);
Mat rectified=Mat.zeros(pixelsHigh,pixelsWide,gray.type());
Imgproc.warpPerspective(gray,rectified, hh,rectified.size());
// condition the output image a little
Core.normalize(rectified,rectified,0,255,Core.NORM_MINMAX,CvType.CV_8UC1);
float meanA=Core.mean(rectified).val[0];
if(meanA&gt;128) Core.bitwise_not(rectified,rectified);        

This is a lot of code but it does several things. First we make a call to a method that should return a list of corners. Next we establish desired width and height of the rectified image. We do that using known dimensions for a business card plus the desired DPI (dots per inch) resolution. For OCR systems this resolution should be somewhere around DPI=300. The step after is most crucial. Namely, we use given corners and desired corners of the card and compute a homography matrix which is then used by "Imgproc.warpPerspective" to undo the distortion. The last step is not needed but is nevertheless useful. We normalize the result so that pixels use full 255 possible values and then we measure the mean value to create a negative of the image if more pixels are white than black. We want the text in white and we expect the white pixels to be in the minority.

The final results is deposited in a new image called "rectified" and looks like the following image:

No alt text provided for this image

This concludes our demonstration of computer vision and OpenCV methods. We started out with an image of a business card, distorted and in an environment and ended up with a rectified image of just the business card. From here, applying optical character recognition should be much easier and more accurate.

Nice explanation Amer. Can you please share the complete java code.Please help me on this.

Like
Reply

Great article! Thank you for sharing! Having an appropriate pre-processing pipeline is key for any Computer Vision application! I also like OpenCV. Not only can you use it from both Java and Python, it can also run on top GPUs. Looking forward to the next post in this series!

Like
Reply

To view or add a comment, sign in

More articles by Amer Agovic, PhD

  • Textify: Extracting Structured Text from Images

    In our previous post we showed how to apply computer vision methods to prepare an image of a document for text…

    1 Comment
  • Zen of Coding

    Whether we develop Enterprise Applications, Machine Learning, Artificial Intelligence, Computer Vision or Mobile…

    1 Comment

Others also viewed

Explore content categories