Document Processing Automation using Azure Form Recognizer
Document Processing Automation has seen promising improvements in recent times, to a large extent due to improvements in Natural Language Processing (NLP). These include improvements in the accuracies in extracting entities, of OCR itself, or the correct identification of information in tabular format. These enhancements enabled Robotic Process Automation (RPA) products to handle a wider variety of use cases and removed previous limitations, which for example imposed strict requirements on template standardization. The enhancements also enabled document processing solutions to be used in new areas like invoice processing, legal contract preparation and risk assessment, etc.
Azure Form Recognizer, a Cognitive Service offering by Microsoft enables Document processing solution developers to quickly build applications leveraging core capabilities that include the extraction of key-value pairs or tables, allowing them to focus on other aspects like data source connectors or rich UI design.
This article intends to highlight some of the challenges faced while building form recognizer models and suggests approaches to overcome them. We also explicitly leverage a solution accelerator that we built to automate an essential part of many of these processes, the grouping of unsorted documents by their characteristics/type.
About Form Recognizer
Azure Form Recognizer’s functionality and models can be broadly categorized into 3 groups, described below.
Layout Model:
This model is used when developers are interested in extracting only tables from documents, e.g., extracting financial data from annual results reports.
Pre-built models:
These models are used when documents being processed belong to one of the built-in categories, such as e.g., invoices, receipts, identity documents like driving licenses, passports, etc.
Custom models:
These models are used when you document doesn’t fit the pre-built category and you need to extract custom key-value pairs and tables. Custom models have 2 sub-categories:
These options are summarized graphically below:
Getting started with a Form Recognizer Project
When you start working on a Form Recognizer project, the first step is to check if your document processing/entity extraction requirements can be met with Layout or Pre-Built models. These models have a lower cost than custom models and don’t require building and maintaining models.
If the previous approach doesn’t fit the requirements, the custom models will be required to accurately extract data. The approach we recommend for these more complex scenarios is described below.
The table below describes the main features and limitations of Custom Models categories.
*Compose model is a functionality in Form recognizer which can `combine up to 100 “Train with labels” models into a single model.
*When inferencing with compose models on new data, compose model will identify the closest match among all models and generates results using that model
To summarize the above table, one can maximize the accuracy of Azure Form Recognizer if
Grouping documents depending on their structures manually is a tedious task particularly for high volume data, such as invoices, correspondence, tax forms, etc. To address this problem, we have developed a Solution Accelerator to perform this task for you – described in the following section. Once the pages are properly grouped, then developers can label required entities as key-value pairs or tables and build “train with labels” models.
Forms Recognizer Document Grouping Solution Accelerator:
Grouping documents together according to their layouts manually is a tedious task. The challenge is made worse if the with data volume are high, as is often the case.
To address this problem, we have developed a document grouping Solution Accelerator, which we make freely available on GitHub with all the source code. This can be used to find the count of different templates, and group the raw documents into templates. The steps are as follows:
Note: the current version of the solution accelerator was developed for version 2.1 of Azure Form Recognizer.
Recommended by LinkedIn
Design
The accelerator uses a combination of functionalities from “Train without labels” and “Train with labels” models to overcome the challenges with data selection for model training, data grouping and template count identification.
Under the hood, when you build a “Train without labels” model on Azure Form Recognizer, the service will sort the pages into templates based on the text content of the pages. It then assigns a unique ID to each template type (the page layout) before training the model.
The accelerator uses this feature to group the documents. Below is the step-by-step process followed:
Repeat from step 3 for the remaining pages. The process is terminated when one of the following conditions are met
The following image shows the overall approach visually.
See animated version here
How to use the accelerator
The Solution Accelerator is available on GitHub on this repository: https://github.com/Azure/form-recognizer-accelerator .
You need the following Azure resources to run this accelerator:
Steps:
Depending on your data size (more precisely page count) the code would take about 2 hours for every 1000 files.
The results will be saved in destination container under a clusters folder as shown below. Each folder will have pages with the same layout/ template.
e.g. I1-1 will have pages of same template, similarly I1-13 will have pages of another template.
Here’s how you can interpret the results obtained from running the code:
How to build robust Form Recognizer models
Once you have the grouped files that are the output to the accelerator, you have what’s required to build “Train with labels” models It is now essential that variations within the pages of same template are part of the training (labelled data), so that you can achieve good accuracies in a production system. We recommend the following approach to achieve this.
And that’s it! Please let us know what you think of the code and report any issues you find on GitHub.
For more information on the Azure Form Recognizer, please check the following pages:
This is exactly what I needed to answer for the question you have asked me in my interview with you. I was trying to reach out to you for this one question but with no luck. Can we connect sometime to have my doubts cleared on this article?
Really nice article Prakash! Love the idea with clustering similar documents 👏
Thanks Prakash. Lets connect and discuss more on specific use cases for Healthcare where Form Recognizer can add value.
Lovely to read this Prakash!