Challenges in Automating business process using Computer Vision

Hariharan Anantharaman

Published Feb 27, 2018

For a while, I was experimenting with Azure cognitive services and similar service from other competitive products. One significant advantage I found with Azure cognitive services was its ability to recognize hand written text with a good level of accuracy than other providers. Encouraged by the good results, we explored the feasibility of using azure cognitive services, namely computer vision service and Translation service to automate translation of documents involved in insurance processing.

How Azure cognitive service work

Computer vision service gives two API’s for text in image recognition. They are

1. OCR

2. Hand written text

The scenarios where to use them is intuitive and implicit. Handwritten text can also recognize printed characters in an image(with a small error %). However another technical difference/challenge is the way they both behave

1. While OCR guarantees identification of text in different languages, for handwritten text it is not the case.

2. The output JSON structure is different for both the API’s. While handwritten text gives output as lines and words in those line, OCR splits the document into different regions and gives the output. Handwritten text gives co-ordinates of 4 edges of a line or word. But OCR just gives the starting co-orinate and width and height of the region or word.

3. Both the API’s gives the co-ordinates of every word. However , for the same input (scanned image having printed text), the co-ordinates of a word for both the API’s vary.

Our scope and challanges

Our requirement is to identify the text in the document (having scanned images of paper documents)and then do the translation of the document.

One normal scenario in Insurance claim processing or other similar use case is that the documents are a mix of hand written text and printed characters. There will also be a significant percentage of images which have part of the content typed and few written by hand (e.g prescription and claim forms). In many cases hand writing is not legible. Given that source of the documents could be from multiple sources (e.g hospitals, pharmacies,labs, ambulance services etc), there is no standadrd template as well.

Many documents will also have check boxes etc, which would be selected by users in the paper form(which is scanned further and comes to the system).

For the business users (e.g underwriters or claim reviewers sitting in different geography),to make decision on translated document, it should be in the same format as such the original.

Challenges

1. Due to the nature of API, it was very difficult to re create the documents in the same dimension as the original. API gives the words their pixel co-ordinates. Though we had written logic to re-create a line based on x and Y-co-oridnates it is not fool proof because

a. If different words in same line has different size or height, it is given in different region or lines. While we could make some decisions based on co-ordinates, there are few cases it is failing. Tweaking it further results in higher false positives.

2. Width of a text in the original document is based on its size and font. The OCR and handwritten API’s do not give the font and size of the original text. So when we try to re-create the document, we use a standadrd size and font(say Arial and 12). If we give the same gap between two words, the end result is not same as original.

3. The APIs cannot determine the check boxes and what was selected. If there is a sensitive information captured in them, there are high chances that the translated or recognized content mis-guiding the underwriter to take wrong decisions.

4. Given that there will be a mix of OCR and handwritten text in most documents and no single API to recognize both of them with a good confidence percentage across all langaugaes, we ended up using both the API’s (We cannot ask business users to choose the type of documents as a claim will have documents of different types. Also if a AI system requires human interference, it is deemed to be intelligent). So in many cases both Handwriting and OCR recognizing the words. So we should write extra intelligence to remove duplicates. This is prone to errors because

a. Both API’s give different co-ordinates for same word(When I say different they are more or less same, and they differ by 5-10 pixels/points). There is no way to standardize the acceptable difference pixel differences and arrive a logic.

b. While handwriting API can recognize typed words in image, if the language is not English the words identified by OCR and handwritten API for the same position differs. We cannot compare their values to see if they both are same. We do not know if it is printed or hand written so as to give priority to an API.

5. Other minor aspect is that hand written text cannot detect orientation, while OCR can give.

6. When a word is recognized, it cannot diffrentiate a noun and others. System will end up translating names of persons, business etc. As a end business user, it makes decision making more human dependent.

Since handwritten API can recognize both OCR and hand written text, our system first recognizes the text first using handwritten API and then calls OCR , identifies words in a line, removes duplicates and re-create the content.

Conclusion

In the long run, unless there is a single API to recognize both handwritten and OCR and also to give more meta data about content (e.g possible font, size, special characters like check box, selected or not etc), the probability of the Azure or similar cognitive services finding traction in real world business case by a huge way is remotely possible.

Due to sensitivity of the data we have processed, I could not post the documents we have used for testing our system.

If any body has used these services (specifically OCR) in their business use case on a scalable manner, would like to know their views and inputs.

Ravi Sastry 8y

Dear Hari.. Good description of the solution developed using optical character reading and language translation solution.

1 Reaction

To view or add a comment, sign in

Challenges in Automating business process using Computer Vision

Hariharan Anantharaman

How Azure cognitive service work

Our scope and challanges

Challenges

Conclusion

More articles by Hariharan Anantharaman

Others also viewed

A Strategic Approach to AI Integration in Existing Software and Solutions – Navigate

GenAI: Job Impacts and Implementation

The Mistakes That Cost Me $50K in My First Year as an AI Developer (And How to Avoid Every One)

The 9 (Plus 2) Characteristics of Successful AI Projects

The 4-Step AI Transformation Model

Multi-Agent Systems: Why One AI Agent Isn't Enough

The Rise of Agentic AI Workflows: Turning AI Into Action

Cognitive Enterprise: Reality or Hype?

Enterprise AI is no longer about isolated experiments. It’s about systemic redesign.

🚀 The Next Competitive Frontier in AI: Capturing and Scaling Human Judgment

Explore content categories

How Azure cognitive service work

Our scope and challanges

Challenges

Conclusion

More articles by Hariharan Anantharaman

EVs and Impact on India's exchequer

Smart Cities during Internet Ban

Few of my recent crazy thoughts

Others also viewed

A Strategic Approach to AI Integration in Existing Software and Solutions – Navigate

GenAI: Job Impacts and Implementation

The Mistakes That Cost Me $50K in My First Year as an AI Developer (And How to Avoid Every One)

The 9 (Plus 2) Characteristics of Successful AI Projects

The 4-Step AI Transformation Model

Multi-Agent Systems: Why One AI Agent Isn't Enough

The Rise of Agentic AI Workflows: Turning AI Into Action

Cognitive Enterprise: Reality or Hype?

Enterprise AI is no longer about isolated experiments. It’s about systemic redesign.

🚀 The Next Competitive Frontier in AI: Capturing and Scaling Human Judgment

Explore content categories