Document Processing Automation using Azure Form Recognizer

Prakash Lekkala

Published May 6, 2022

Document Processing Automation has seen promising improvements in recent times, to a large extent due to improvements in Natural Language Processing (NLP). These include improvements in the accuracies in extracting entities, of OCR itself, or the correct identification of information in tabular format. These enhancements enabled Robotic Process Automation (RPA) products to handle a wider variety of use cases and removed previous limitations, which for example imposed strict requirements on template standardization. The enhancements also enabled document processing solutions to be used in new areas like invoice processing, legal contract preparation and risk assessment, etc.

Azure Form Recognizer, a Cognitive Service offering by Microsoft enables Document processing solution developers to quickly build applications leveraging core capabilities that include the extraction of key-value pairs or tables, allowing them to focus on other aspects like data source connectors or rich UI design.

This article intends to highlight some of the challenges faced while building form recognizer models and suggests approaches to overcome them. We also explicitly leverage a solution accelerator that we built to automate an essential part of many of these processes, the grouping of unsorted documents by their characteristics/type.

About Form Recognizer

Azure Form Recognizer’s functionality and models can be broadly categorized into 3 groups, described below.

Layout Model:

This model is used when developers are interested in extracting only tables from documents, e.g., extracting financial data from annual results reports.

Pre-built models:

These models are used when documents being processed belong to one of the built-in categories, such as e.g., invoices, receipts, identity documents like driving licenses, passports, etc.

Custom models:

These models are used when you document doesn’t fit the pre-built category and you need to extract custom key-value pairs and tables. Custom models have 2 sub-categories:

Train without labels: These models only require input data and will automatically detect key-value pairs and tables during training. To be clear, an example of a Key-Value pair would be a field “Name”, followed by the name of a person.
Train with labels: These models require input data and key-value & table annotations to train models.

These options are summarized graphically below:

Getting started with a Form Recognizer Project

When you start working on a Form Recognizer project, the first step is to check if your document processing/entity extraction requirements can be met with Layout or Pre-Built models. These models have a lower cost than custom models and don’t require building and maintaining models.

If the previous approach doesn’t fit the requirements, the custom models will be required to accurately extract data. The approach we recommend for these more complex scenarios is described below.

The table below describes the main features and limitations of Custom Models categories.

*Compose model is a functionality in Form recognizer which can `combine up to 100 “Train with labels” models into a single model.

*When inferencing with compose models on new data, compose model will identify the closest match among all models and generates results using that model

To summarize the above table, one can maximize the accuracy of Azure Form Recognizer if

You can identify templates (i.e., document format/structure) in your data and group the pages according to the template into folders
Build a model for each template and compose all models

Grouping documents depending on their structures manually is a tedious task particularly for high volume data, such as invoices, correspondence, tax forms, etc. To address this problem, we have developed a Solution Accelerator to perform this task for you – described in the following section. Once the pages are properly grouped, then developers can label required entities as key-value pairs or tables and build “train with labels” models.

Forms Recognizer Document Grouping Solution Accelerator:

Grouping documents together according to their layouts manually is a tedious task. The challenge is made worse if the with data volume are high, as is often the case.

To address this problem, we have developed a document grouping Solution Accelerator, which we make freely available on GitHub with all the source code. This can be used to find the count of different templates, and group the raw documents into templates. The steps are as follows:

Identify unique page layouts from raw input documents
Group together document pages with a common layout
Provide a high-level view on the percentage of automation possible for your dataset
Also separate pages which doesn’t have a structure, i.e., pages not suitable to extract using Form Recognizer and more suitable for services such as the Azure Cognitive Service for Language.

Note: the current version of the solution accelerator was developed for version 2.1 of Azure Form Recognizer.

Design

The accelerator uses a combination of functionalities from “Train without labels” and “Train with labels” models to overcome the challenges with data selection for model training, data grouping and template count identification.

Under the hood, when you build a “Train without labels” model on Azure Form Recognizer, the service will sort the pages into templates based on the text content of the pages. It then assigns a unique ID to each template type (the page layout) before training the model.

The accelerator uses this feature to group the documents. Below is the step-by-step process followed:

Data is placed on an Azure Storage blob (Formats supported are PDF, PNG, JPEG and TIFF)
All documents are converted to PDF and split into single page documents (“1p-docs”)
500 pages of 1p-docs are then sampled
A “Train without labels” model is built -- we refer to it as “m1”
All the pages in “1p-docs” are inferred with model m1
If a page result has a cluster Id (as per the screenshot above), create a folder in storage, and move the document to that location

Repeat from step 3 for the remaining pages. The process is terminated when one of the following conditions are met

all data from “1p-docs” was already considered for training the model
Count of documents grouped into clusters in the current iteration is < 5% of initial page count. (when this scenario happens, most of the documents left in 1p-docs are noisy i.e. there is no structure or template)

The following image shows the overall approach visually.

See animated version here

How to use the accelerator

The Solution Accelerator is available on GitHub on this repository: https://github.com/Azure/form-recognizer-accelerator .

You need the following Azure resources to run this accelerator:

Azure Blob Storage (link)
Azure Form Recognizer (link)
Optionally, an Azure Machine Learning Workspace with a Compute Instance (or a Data Science Virtual Machine) (link)

Steps:

Create a Container in your blob storage account and place your files in it
Create a second Container to store the results
Clone the contents of the GitHub repository to the Compute Instance, Data Science Virtual Machine, VS Code, or any other way you have to run Jupyter Notebooks
Update the "code\main.ipynb” notebook with the name of the source & results containers, plus the connection string to the Storage Account:

Run the “main.ipynb” Notebook

Depending on your data size (more precisely page count) the code would take about 2 hours for every 1000 files.

The results will be saved in destination container under a clusters folder as shown below. Each folder will have pages with the same layout/ template.

e.g. I1-1 will have pages of same template, similarly I1-13 will have pages of another template.

Here’s how you can interpret the results obtained from running the code:

The folder with the most files in it is the most frequently seen template/page layout (e.g. I1-9 )
The total count of all pages in each cluster folder / count of pages in “1p-docs” give a good indication of the automation percentage possible with Azure Form Recognizer for this set of documents. If a given cluster folder has a good number of files, this means they all share the same layout, and thus creating a “Train with labels” model will be advantageous.
The number of cluster folders is the maximum number of custom models that will potentially need to be built
The files in each cluster folder can be used as the inputs for building “Train with labels” models in Form Recognizer

How to build robust Form Recognizer models

Once you have the grouped files that are the output to the accelerator, you have what’s required to build “Train with labels” models It is now essential that variations within the pages of same template are part of the training (labelled data), so that you can achieve good accuracies in a production system. We recommend the following approach to achieve this.

Randomly pick 5 to 10 pages from a cluster, label the fields, and train the model - let’s call it v1
Infer all documents in the cluster with model v1
Record the docTypeConfidence score for each page. This is a number between 0 and 1 indicating how confident Form Recognizer is of having detected the right page layout.
Randomly pick 10 pages with low confidence scores, add them to input data and build the model (now, v2)
Infer all documents with v2 model
Repeat this process until the confidence score is above 50% for every document. At this point you have a robust model.

And that’s it! Please let us know what you think of the code and report any issues you find on GitHub.

For more information on the Azure Form Recognizer, please check the following pages:

Lekha Priyadarshini Bhan 3y

This is exactly what I needed to answer for the question you have asked me in my interview with you. I was trying to reach out to you for this one question but with no luck. Can we connect sometime to have my doubts cleared on this article?

Natasha Ueberschlag 3y

Really nice article Prakash! Love the idea with clustering similar documents 👏

2 Reactions

Prashant Naidu 3y

Thanks Prakash. Lets connect and discuss more on specific use cases for Healthcare where Form Recognizer can add value.

Shubham Arora 3y

Lovely to read this Prakash!

See more comments

Document Processing Automation using Azure Form Recognizer

Prakash Lekkala

About Form Recognizer

Layout Model:

Pre-built models:

Custom models:

Getting started with a Form Recognizer Project

Forms Recognizer Document Grouping Solution Accelerator:

Recommended by LinkedIn

Design

How to use the accelerator

How to build robust Form Recognizer models

Others also viewed

Beyond the Builder: How AI Tools Are Supercharging Low-Code Platforms

OPTICAL CHARACTER RECOGNITION (OCR)

AI for Developers: Essential Tools Every IT Team Needs in 2025

Data Extraction in Text Analytics

Unlock the Benefits of Intelligent Document Processing (IDP)

Building a voice enabled chatbot part 4a

AI-Driven Intelligent Document Processing (IDP): Transforming Document Management for Businesses

Revolutionizing Document Management: AI-driven OCR and NLP for Automated Classification

Traditional OCR vs. Modern Document Data Understanding (DDU) in Financial Organizations

Explore content categories