Document Processing Automation using Azure Form Recognizer

Document Processing Automation using Azure Form Recognizer

Document Processing Automation has seen promising improvements in recent times, to a large extent due to improvements in Natural Language Processing (NLP). These include improvements in the accuracies in extracting entities, of OCR itself, or the correct identification of information in tabular format. These enhancements enabled Robotic Process Automation (RPA) products to handle a wider variety of use cases and removed previous limitations, which for example imposed strict requirements on template standardization. The enhancements also enabled document processing solutions to be used in new areas like invoice processing, legal contract preparation and risk assessment, etc.

Azure Form Recognizer, a Cognitive Service offering by Microsoft enables Document processing solution developers to quickly build applications leveraging core capabilities that include the extraction of key-value pairs or tables, allowing them to focus on other aspects like data source connectors or rich UI design.

This article intends to highlight some of the challenges faced while building form recognizer models and suggests approaches to overcome them. We also explicitly leverage a solution accelerator that we built to automate an essential part of many of these processes, the grouping of unsorted documents by their characteristics/type.


About Form Recognizer

Azure Form Recognizer’s functionality and models can be broadly categorized into 3 groups, described below.

Layout Model:

This model is used when developers are interested in extracting only tables from documents, e.g., extracting financial data from annual results reports.

Pre-built models:

These models are used when documents being processed belong to one of the built-in categories, such as e.g., invoices, receipts, identity documents like driving licenses, passports, etc.

Custom models:

These models are used when you document doesn’t fit the pre-built category and you need to extract custom key-value pairs and tables. Custom models have 2 sub-categories:

  • Train without labels: These models only require input data and will automatically detect key-value pairs and tables during training. To be clear, an example of a Key-Value pair would be a field “Name”, followed by the name of a person.
  • Train with labels: These models require input data and key-value & table annotations to train models.

These options are summarized graphically below:

No alt text provided for this image



Getting started with a Form Recognizer Project

When you start working on a Form Recognizer project, the first step is to check if your document processing/entity extraction requirements can be met with Layout or Pre-Built models. These models have a lower cost than custom models and don’t require building and maintaining models.

If the previous approach doesn’t fit the requirements, the custom models will be required to accurately extract data. The approach we recommend for these more complex scenarios is described below.

The table below describes the main features and limitations of Custom Models categories.

No alt text provided for this image

*Compose model is a functionality in Form recognizer which can `combine up to 100 “Train with labels” models into a single model.

*When inferencing with compose models on new data, compose model will identify the closest match among all models and generates results using that model

To summarize the above table, one can maximize the accuracy of Azure Form Recognizer if

  • You can identify templates (i.e., document format/structure) in your data and group the pages according to the template into folders
  • Build a model for each template and compose all models

Grouping documents depending on their structures manually is a tedious task particularly for high volume data, such as invoices, correspondence, tax forms, etc. To address this problem, we have developed a Solution Accelerator to perform this task for you – described in the following section. Once the pages are properly grouped, then developers can label required entities as key-value pairs or tables and build “train with labels” models.


Forms Recognizer Document Grouping Solution Accelerator:

Grouping documents together according to their layouts manually is a tedious task. The challenge is made worse if the with data volume are high, as is often the case.

To address this problem, we have developed a document grouping Solution Accelerator, which we make freely available on GitHub with all the source code. This can be used to find the count of different templates, and group the raw documents into templates. The steps are as follows:

  1. Identify unique page layouts from raw input documents
  2. Group together document pages with a common layout
  3. Provide a high-level view on the percentage of automation possible for your dataset
  4. Also separate pages which doesn’t have a structure, i.e., pages not suitable to extract using Form Recognizer and more suitable for services such as the Azure Cognitive Service for Language.

Note: the current version of the solution accelerator was developed for version 2.1 of Azure Form Recognizer.

Design

The accelerator uses a combination of functionalities from “Train without labels” and “Train with labels” models to overcome the challenges with data selection for model training, data grouping and template count identification.

Under the hood, when you build a “Train without labels” model on Azure Form Recognizer, the service will sort the pages into templates based on the text content of the pages. It then assigns a unique ID to each template type (the page layout) before training the model.

No alt text provided for this image

The accelerator uses this feature to group the documents. Below is the step-by-step process followed:

  1. Data is placed on an Azure Storage blob (Formats supported are PDF, PNG, JPEG and TIFF)
  2. All documents are converted to PDF and split into single page documents (“1p-docs”)
  3. 500 pages of 1p-docs are then sampled
  4. A “Train without labels” model is built -- we refer to it as “m1”
  5. All the pages in “1p-docs” are inferred with model m1
  6. If a page result has a cluster Id (as per the screenshot above), create a folder in storage, and move the document to that location

Repeat from step 3 for the remaining pages. The process is terminated when one of the following conditions are met

  • all data from “1p-docs” was already considered for training the model
  • Count of documents grouped into clusters in the current iteration is < 5% of initial page count. (when this scenario happens, most of the documents left in 1p-docs are noisy i.e. there is no structure or template)

The following image shows the overall approach visually.

No alt text provided for this image

See animated version here

How to use the accelerator

The Solution Accelerator is available on GitHub on this repository: https://github.com/Azure/form-recognizer-accelerator .

You need the following Azure resources to run this accelerator:

  • Azure Blob Storage (link)
  • Azure Form Recognizer (link)
  • Optionally, an Azure Machine Learning Workspace with a Compute Instance (or a Data Science Virtual Machine) (link)

Steps:

  1. Create a Container in your blob storage account and place your files in it
  2. Create a second Container to store the results
  3. Clone the contents of the GitHub repository to the Compute Instance, Data Science Virtual Machine, VS Code, or any other way you have to run Jupyter Notebooks
  4. Update the "code\main.ipynb” notebook with the name of the source & results containers, plus the  connection string to the Storage Account:

No alt text provided for this image

  1. Run the “main.ipynb” Notebook

Depending on your data size (more precisely page count) the code would take about 2 hours for every 1000 files.

The results will be saved in destination container under a clusters folder as shown below. Each folder will have pages with the same layout/ template.

e.g. I1-1 will have pages of same template, similarly I1-13 will have pages of another template.

No alt text provided for this image

Here’s how you can interpret the results obtained from running the code:

  1. The folder with the most files in it is the most frequently seen template/page layout (e.g. I1-9 )
  2. The total count of all pages in each cluster folder / count of pages in “1p-docs” give a good indication of the automation percentage possible with Azure Form Recognizer for this set of documents. If a given cluster folder has a good number of files, this means they all share the same layout, and thus creating a “Train with labels” model will be advantageous.
  3. The number of cluster folders is the maximum number of custom models that will potentially need to be built
  4. The files in each cluster folder can be used as the inputs for building “Train with labels” models in Form Recognizer


How to build robust Form Recognizer models

Once you have the grouped files that are the output to the accelerator, you have what’s required to build “Train with labels” models It is now essential that variations within the pages of same template are part of the training (labelled data), so that you can achieve good accuracies in a production system. We recommend the following approach to achieve this.

  1. Randomly pick 5 to 10 pages from a cluster, label the fields, and train the model - let’s call it v1
  2. Infer all documents in the cluster with model v1
  3. Record the docTypeConfidence score for each page. This is a number between 0 and 1 indicating how confident Form Recognizer is of having detected the right page layout.
  4. Randomly pick 10 pages with low confidence scores, add them to input data and build the model (now, v2)
  5. Infer all documents with v2 model
  6. Repeat this process until the confidence score is above 50% for every document. At this point you have a robust model.


And that’s it! Please let us know what you think of the code and report any issues you find on GitHub.

For more information on the Azure Form Recognizer, please check the following pages:

This is exactly what I needed to answer for the question you have asked me in my interview with you. I was trying to reach out to you for this one question but with no luck. Can we connect sometime to have my doubts cleared on this article?

Like
Reply

Really nice article Prakash! Love the idea with clustering similar documents 👏

Thanks Prakash. Lets connect and discuss more on specific use cases for Healthcare where Form Recognizer can add value.

To view or add a comment, sign in

Others also viewed

Explore content categories