Business Document Processing with AI
Photo by Kelly Lacy from Pexels

Business Document Processing with AI

Business documents are special. Whether it's invoices, payment advice, bills of lading, or goods receipts: documents are hugely relevant to daily business operations. And they abound with rich table structures, special terms, reference numbers, numerical quantities and financials.

Traditional approaches to document processing follow a two-step approach. They first turn pixels into characters and words with optical character recognition (OCR). Then they try and extract relevant entities, like a vendor name or a total price. There are three main challenges with this approach:

  1. OCR doesn't work all that well. Error rates for individual characters remain significant. Today's complex OCR pipelines try to hide this behind several iterative steps, including language models. That smudged third character after the "t" and the "h"? It's probably an "e". This approach works well for letters, articles and books. But for business documents with many out-of-vocabulary words like product names, reference numbers, and financial amounts, language models are less impactful.
  2. Passing only the completed words from OCR to entity extraction turns all gray areas into black and white. Even where the OCR isn't quite sure, it has to make a call. Passing richer information about probabilities to entity extraction can help disambiguate the close calls based on broader document context.
  3. Many entity extraction approaches feed one character or one word at a time into a recurrent neural network, e.g. using long short-term memory neurons (LSTMs) or gated recurrent units (GRUs). That works well for a sentence or two, like in chatbot scenarios. But when texts become longer like this article, recurrent networks have forgotten the beginning when they reach the end. That can be partially addressed by feeding the text into a second network reading it backwards. What cannot be addressed well at all are documents with two-dimensional structures like tables. Their positional meaning is lost when you turn them into one long spaghetti.

The AI teams at SAP have found a way to process business documents which addresses these three challenges. The Chargrid represents a document as an image with many color channels. Instead of three channels for red, green and blue, the Chargrid has one color channel for the letter "a", one for the letter "b", and so on.

Raw document and chargrid representation

This representation unlocks a broad range of neural network techniques from computer vision for processing text:

  • With Chargrid, OCR becomes an ultra-dense object detection task. The basic object detection problem is well known in computer vision, and can be efficiently addressed with fully convolutional networks. Chargrid-OCR consists of an encoder and two decoder stages, which are strictly feed-forward and run well on accelerators like GPUs. The encoder applies several convolutions to the data, reducing its spatial size while increasing the depth of the latent representation. Downsampling is critical here, as high-resolution A4 or letter documents are significantly larger than typical ImageNet picture sizes. The two decoders each apply a series of deconvolutions, which make the representation wider and shallower again. The first decoder semantically segments the image, labelling each pixel with the corresponding character. The second decoder generates region proposals in the form of bounding boxes. As every pixel generates a bounding box candidate, strong data reduction is required. Chargrid-OCR solves this with a graph-based approach and non-maximum suppression.
No alt text provided for this image
  • Entity extraction with Chargrid also becomes a multi-object detection task. The basic network architecture is highly similar to Chargrid-OCR: first a convolutional encoder stage which turns the document into a smaller but deeper latent representation. Then two parallel deconvolutional decoders, which generate a segmentation into entities of interest, and bounding box proposals, respectively. The fully convolutional, feed-forward only nature of the network allows it to run efficiently on modern accelerator hardware, and to gracefully deal with many different formats.
Chargrid representation and predicted entities: six examples
  • Combination with language models. While Chargrid is strong on out-of-vocabulary terms commonly found in business documents, the addition of language models can make it even stronger. BERTgrid combines the character based representation for segmentation and bounding boxes with pre-trained contextual word vectors from BERT. Using this representation in parallel to the character embeddings further improves performance on in-vocabulary words, e.g. headers, product and service descriptions, as well as footers.
No alt text provided for this image

Chargrid is processing millions of business documents at SAP today. And the team continue to add new document-centric services. Learn more here.

I would like to extend my thanks to the spirited researchers and engineers in the deep learning center of excellence at SAP, as well as to the enterprising AI business services team at SAP. You took this idea, ran with it, and made it real. Thank you for this. All figures from original papers used with permission.

References

Chargrid: Towards Understanding 2D Documents (Katti, Reisswig, Guder, Brarda, Bickel, Höhne, Faddoul; EMNLP 2018)

Chargrid-OCR: End-to-end Trainable Optical Character Recognition for Printed Documents using Instance Segmentation (Reisswig, Katti, Spinaci, Höhne; NeurIPS 2019 Document Intelligence Workshop) 

BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding (Denk, Reisswig; NeurIPS 2019 Document Intelligence Workshop) 

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin, Chang, Lee, Toutanova; CoRR 2019)

Deep Learning (Goodfellow, Bengio, Courville; MIT Press 2016)

Markus, thanks for sharing!

Like
Reply

A presentation once showed evidence of OCR in use on invoices and its random effects on numbers: changing, omitting .... Seemed like QS was never before comparing input to output. Good to see CharGrid! The Character to Frequency picture looks like a typeface editor from early days of postscript.

Like
Reply

To view or add a comment, sign in

More articles by Markus Noga

Others also viewed

Explore content categories