Using Computer Vision to Prepare Images for Text Extraction

Amer Agovic, PhD

Published Jul 6, 2017

They say that a picture is worth a thousand words. That maybe true, that is if you can see.

On the other hand if your job is data entry and if you have a stack of reports full of tables and if all those tables need to be transferred to excel you might prefer the words. As crisp and electronic as possible.

We would like to show you how pictures are transformed into text that can be used for machine learning and more.

In this process the input is a picture of a document, scanned in or taken by a mobile phone. The contents can be a business card, an invoice or a financial report.

The overall process of transforming a picture to text involves two key elements:

Computer Vision or CV
Optical Character Recognition or OCR

We use methods developed in computer vision to prepare the picture. This includes color transformation, contrast enhancement, noise reduction, rectification and binarization.

If all goes well the output of pre-processing and also the input to optical character recognition is a cleaned up, monochrome picture where the white foreground is the text we want to extract and everything else is black background.

Such processed image is then fed into OCR. The OCR algorithm uses additional computer vision methods to segment foreground into letters and uses pattern matching to decide which letters they are.

The output of OCR is a list of rectangles and detected text. Depending on the recognition granularity the rectangle can be per letter, per word, per line or per whole page. The finer the granularity the more information about the layout is retained. If content is a table then granularity must be at level of words or letters to be able to extract columns and rows.

In the first post of this series we will tackle pre-processing of document images using Computer Vision. The OCR will be treated in the second part.

Rectify: Using Computer Vision

In the original post you can find detailed instructions on setting up the software environment used in this article/tutorial.

Once the environment is set up we are ready to use OpenCV and Tesseract methods in Java. Let us start with a sample image:

Our first piece of code is to load the image and to convert it to grayscale. We do this so that both gray and color images are on the same level and to simplify further steps.

import org.opencv.core.*;
import org.opencv.imgcodecs.Imgcodecs;
import org.opencv.imgproc.Imgproc;
 
class Textify{
    static {
        try {
            System.loadLibrary(Core.NATIVE_LIBRARY_NAME);
        } catch (IOException ex) {
            java.util.logging.Logger.getLogger(Textify.class.getName()).log(Level.SEVERE, null, ex);
        }
    }
    // main code goes here
    public static void main(String...arg) throws IOException{
        String inputFile="data/bcard.jpg";
        // load input image
        Mat in=Imgcodecs.imread(inputFile);
        // convert to grayscale
        Mat gray = new Mat(in.size(),CvType.CV_8UC1);
        if(in.channels()==3){
            Imgproc.cvtColor(in, gray, Imgproc.COLOR_RGB2GRAY);
        }else if(in.channels()==1){
            in.copyTo(gray);
        }else{
            throw new IOException("Invalid image type:"+in.type());
        }
    }
}

Our code so far calls "Imgcodecs.imread()" to load an image into a variable of type "Mat". OpenCV variable type "Mat" represents 2D data such as images and matrices. It has a size expressed in rows and columns and it has a type that encodes data format. Common formats are single channel images, RGB images, floating point matrices and a variety of other formats that a 2D matrix can represent.

After the image is loaded we proceed to allocate an additional matrix named "gray" of the same size and in the format "CvType.CV_8UC1" which encodes single channel, 8-bit per channel, unsigned data.

If the input image has three channels we assume it is an RGB color image and we call "Imgproc.cvtColor()" to convert it to gray scale. If the number of channels is one we just copy the input image to gray variable.

At this point the resulting image looks like this:

Recommended by LinkedIn

Image Processing: Valuable Information Extraction from…

Karuna Puri 8 years ago

Image Processing: Convolution filters and Calculation…

Abhishake Yadav 3 years ago

Thyme: Think Beyond Images

Vlad Bogolin 8 months ago

So far we have introduced the most important OpenCV data structure the matrix "Mat" and a few important methods to work with it.

We will now take it up a notch. The image is of a business card and it was taken under an angle. As it is, the text is distorted and the OCR system will have trouble detecting text.

We need to undo the perspective distortion or as I like to call it we need to "rectify" the image.

To rectify the image we need to locate the corners of the business card and warp the pixels so that document corners end up in the corners of the resulting image.

The first step is to find the contours. To that end we produce a binary image that separates the card from the background and is more suitable for contour finding algorithm.

// we first blur the gray scale image 
Mat blurred = new Mat(gray.size(),gray.type()); 
Mat binary4c = new Mat(gray.size(),gray.type()); 
Imgproc.GaussianBlur(gray,blurred,new Size(20,20),0); 
// next we threshold the blurred image 
 float th=128; 
Imgproc.threshold(blurred,binary4c,th,255,Imgproc.THRESH_BINARY);

The resulting binary image might be slightly different depending on the threshold value but looks approximately like the following:

Because we used blurring the smaller details were "white washed" but the contours are a little rounded.

The final step is to find the contours.

Mat mHierarchy = new Mat(); 
List<MatOfPoint> contours = new ArrayList<MatOfPoint>(); 
Imgproc.findContours(binary4c, contours, mHierarchy, Imgproc.RETR_LIST,Imgproc.CHAIN_APPROX_SIMPLE);

The method "Imgproc.findContours()" will return a list of contours with each contour a polygon of 2D points. Out of these contours we need to pick the one that belongs to the document.

Once the contour is located the corners need to be detected. How to accomplish that is not trivial but not too complex either. It all depends on how robust you want the method to be.

When the corner detection is done we get the following end result. The contour and the corners are highlighted in color.

With the corners established we can now proceed to undo the projective distortion.

int dpi=300;    // select dpi, 300 is optimal for OCR
List&lt;Point&gt; corners=getOuterQuad(binary4c); // find corners from contour
Point inchesDim=getFormatDim();  // select business card dimensions and compute pixels
float inchesWide=inchesDim.x;
float inchesHigh=inchesDim.y;
pixelsWide=(int)(inchesWide*dpi); // width and height of business card at given dpi
pixelsHigh=(int)(inchesHigh*dpi);
// now establish from and to parameters for warpPerspective
Point[] fromPts = {corners.get(0),corners.get(1),corners.get(2),corners.get(3)};
Point[] toPts = {new Point(0,0), new Point(0,pixelsHigh), new Point(pixelsWide,pixelsHigh), new Point(pixelsWide,0)};
MatOfPoint2f srcPts = new MatOfPoint2f(); srcPts.fromArray(fromPts);
MatOfPoint2f dstPts = new MatOfPoint2f(); dstPts.fromArray(toPts);
Mat hh=Calib3d.findHomography(srcPts,dstPts);
Mat rectified=Mat.zeros(pixelsHigh,pixelsWide,gray.type());
Imgproc.warpPerspective(gray,rectified, hh,rectified.size());
// condition the output image a little
Core.normalize(rectified,rectified,0,255,Core.NORM_MINMAX,CvType.CV_8UC1);
float meanA=Core.mean(rectified).val[0];
if(meanA&gt;128) Core.bitwise_not(rectified,rectified);

This is a lot of code but it does several things. First we make a call to a method that should return a list of corners. Next we establish desired width and height of the rectified image. We do that using known dimensions for a business card plus the desired DPI (dots per inch) resolution. For OCR systems this resolution should be somewhere around DPI=300. The step after is most crucial. Namely, we use given corners and desired corners of the card and compute a homography matrix which is then used by "Imgproc.warpPerspective" to undo the distortion. The last step is not needed but is nevertheless useful. We normalize the result so that pixels use full 255 possible values and then we measure the mean value to create a negative of the image if more pixels are white than black. We want the text in white and we expect the white pixels to be in the minority.

The final results is deposited in a new image called "rectified" and looks like the following image:

This concludes our demonstration of computer vision and OpenCV methods. We started out with an image of a business card, distorted and in an environment and ended up with a rectified image of just the business card. From here, applying optical character recognition should be much easier and more accurate.

Nagarjuna nethaji 6y

Nice explanation Amer. Can you please share the complete java code.Please help me on this.

Rudy Agovic, PhD 8y

Great article! Thank you for sharing! Having an appropriate pre-processing pipeline is key for any Computer Vision application! I also like OpenCV. Not only can you use it from both Java and Python, it can also run on top GPUs. Looking forward to the next post in this series!

See more comments

To view or add a comment, sign in

Using Computer Vision to Prepare Images for Text Extraction

Amer Agovic, PhD

Rectify: Using Computer Vision

Recommended by LinkedIn

More articles by Amer Agovic, PhD

Others also viewed

Day-3: Image Processing and ML

System 3 AI is Growing On Us

A Data Science Project for Image classification

Huffman coding and It's Application in Image processing.

VGG16 Net for Facial Recognition using Web Scraping with Selenium (Advanced CNN Project) - Transfer Learning & Augmentation.

Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights

ASR (Automatic Speech Recognition) Research/Test

All about AI Benchmarks

Zero to 3D: Building the Neural Code Graph in < 12 Hours

🚀 AI Engineer 2026 Roadmap Checklist

Explore content categories

Rectify: Using Computer Vision

Recommended by LinkedIn

More articles by Amer Agovic, PhD

Textify: Extracting Structured Text from Images

Zen of Coding

Others also viewed

Day-3: Image Processing and ML

System 3 AI is Growing On Us

A Data Science Project for Image classification

Huffman coding and It's Application in Image processing.

VGG16 Net for Facial Recognition using Web Scraping with Selenium (Advanced CNN Project) - Transfer Learning & Augmentation.

Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights

ASR (Automatic Speech Recognition) Research/Test

All about AI Benchmarks

Zero to 3D: Building the Neural Code Graph in < 12 Hours

🚀 AI Engineer 2026 Roadmap Checklist

Similar topics

Techniques for Computer Vision

AI Techniques For Document Image Recognition

Explore content categories