Document Processing with Microsoft Generative AI
In my day-to-day interactions with Microsoft customers, I see many organizations still getting started with generative/large language model AI. Every day a ton of new material is released on the latest information about new services, parts of services, third party software, or new models that can help you on your AI journey. When I read or listen to much of this, I notice that there are many times some assumptions. Above all else, you the reader/listener already know a wide variety of things you can do with large language models.
By now, yes, just about anybody with a computer or mobile device will probably have used a chat interface like ChatGPT, Claude, or Microsoft’s M365 Copilot. Outside of work you have maybe gotten to a point where you use this type of interface more often than you use a web browser to look up information. If you do use a browser, you’ll likely get an AI summary much like the response you’d get from your AI assistant of choice. At work, your organization may have hooked up M365 copilot to internal resources like Sharepoint and Teams allowing you to answer business questions grounded on information from your organization. And maybe your knowledge of what is possible ends there or is accentuated by a handful of presentations you may have seen about AI.
Starting to understand more about what you can do with AI in business goes hand in hand with understanding the flow of business. Most types of business have some things in common. People search for information, people save information, and often times this involves working with documents in some way. Understanding AI on the next level includes understanding the problems people have in doing these things. Taking 30 minutes to find information rather than single digit minutes. Spending any amount of time entering information into software when the information is already available in another format (like in a document’s contents or in its metadata). Or realizing things that were virtually impossible to do three years ago like taking 10 million historical documents and auto coding them with business metadata like a category are now possible.
There are some basic patterns for the use of large language models that can be applied to many potential AI use cases, AKA business improvement processes that can have a huge impact on operations. The focus of this article is things you can do with documents that you may not have learned just yet.
Document Content Extraction
Prior to large language model/generative AI there was machine learning where you could train a model to do a task. This works great in some cases, especially mathematically oriented ones like predicting the inventory you should keep on hand or deciding what you should charge for a new product. It was possible to build models to do text oriented things like extracting content from a document, but the models needed to be trained on many examples or it may not be capable of handling variety in your document layouts. For instance, while the following two invoices are similar to the human brain, an undertrained model meant to extract certain details may not be able to handle the slight variation between the two.
There were companies that built their business around one machine learning model that was built for one purpose like reviewing contracts, extracting details from an invoice, or reading the contents of an insurance card. Microsoft has a service now called Document Intelligence which has had models like this for many years now.
Along with out of the box models, the service gives you the ability to take a base model and train it further to meet your specific needs. This model of document handling may very well still work great in some cases, but generative AI has opened up so many more possibilities. Such as a case where you know the documents you want to handle have not only a wide variety, but where you know a new format could come in at any time. Machine learning models care about the layout of your document. Large language models do not.
That said, let us walk through the following scenario:
· Extract certain details from a document of unknown format
· Enter those details into a database
· Check to see that details were captured including using a confidence score
· Alert for human interaction if certain details were not captured satisfactorily
Step One. Use an OCR model to extract the document contents.
In Azure Document Intelligence, you can use its OCR/Read model to do this step. It is a very powerful OCR model that can not only read text but can extract text from images and handle very large document sizes.
The front end interface for Document Intelligence allows you to upload a document and run an analysis which provides multiple outputs. Above, for a relatively simple invoice, the analysis result shows three tabs. One showing content broken out into paragraphs (not shown), one showing a JSON output of the contents (shown), and a “code” tab that shows the code you could write to automate this process with language choices including C#, Python, and Javascript. Out of all the JSON output, which includes details on every pixel of the layout, all we care about is the raw content. This is highlighted above under the “content” tag. That raw output looks like this:
"content": "INVOICE\n333 3rd Ave Seattle, WA 12345 Phone: 123-456-7890\nPURCHASED BY: Liane Cormier The Social Strategists 4321 Maplewood Ave\nNashville, TN 54321\nPhone: 111-222-3333\nINVOICE # 100 DATE: 1/30/23\nSHIP TO: Liane Cormier The Social Strategists 4321 Maplewood Ave Nashville, TN 54321 Phone: 111-222-3333\nCOMMENTS OR SPECIAL INSTRUCTIONS: Due upon receipt\nQUANTITY\nDESCRIPTION\nUNIT PRICE\nTOTAL\n1 TB\nCloud service\n99.99\n99.99\n1TB\n10.00\n10.00\n0 12345 67890 5\nSubtotal\n109.99\nSales tax\n4.99\nShipping and handling\n0.00\nTOTAL DUE\n114.98",
To the human eye not the most useful right? But to a large language model a different story.
Above is a view of the Chat Playground feature in Microsoft Foundry. Notice a few highlighted things:
· Under the Deployment header you define the large language model you want to work with
· The Blue section is a prompt. In this case I am asking the model to extract a long list of details from the messy blob of text extracted above. I then ask it to return the response in a CSV format with headers along with a confidence score represented in a percentage from 1 to 100.
· The black section is the output and as you can see it has done exactly what I asked it to do
Recommended by LinkedIn
This use of Microsoft Foundry is a way to quickly test that something you want to do with AI works before actually building anything or writing any code. With this workflow you have:
· Used the Document Intelligence OCR model to extract the raw contents of a document with no care for the format of the output
· Fed that output to a large language model and asked it to do something such as extract certain details
· Received the output as requested
Further things you could do once you have tested your theory against a few documents using the front ends of Document Intelligence and Microsoft Foundry
· Write a process to ingest the CSV output to a SQL Server database
· Write a process to query what was written to the database looking for missing details or confidence scores below a number like 85%
· Build a process to alert somebody or a group via email or Teams anytime something is not satisfactory about a document’s finding including missing details
· None of these steps are AI related. These are things you can do after AI has done the heavy lifting of extracting details from the document’s contents.
· And finally you can completely automate the process by building a workflow to call the APIs of Document Intelligence and Microsoft Foundry along with the steps above. This would include a starter step like looking at a storage location for new documents to process.
Document Intelligence used to be the primary service for document handling. Now Microsoft offers a new service called Content Understanding that wraps some of the steps above into one service including extracting the contents of a document and then using a large language model to do the detailed extraction. Documentation on this service is already becoming fairly lengthy due to its wide variety of capabilities: What is Azure Content Understanding in Foundry Tools? - Foundry Tools | Microsoft Learn. For any workflow similar to the one above, Content Understanding may be where you want to begin, see if it meets the needs of your use case, then move on to other Azure tools like Document Intelligence if needed.
Plus there are many other ways to build this type of workflow in the Microsoft ecosystem including Copilot Studio/Power Platform or using something like Azure Logic Apps to orchestrate a custom workflow process. Both offer lower code ways of building an AI driven process than wrapping python or C# code around calls to APIs, writing data to a SQL Server database, etc. My advice, in regards to building AI driven processes, is always to start with as low of a code approach as possible then write code when it makes sense. For this particular example, a limitation of Copilot Studio/Power Platform may be that its internal document reader (OCR) can only handle documents up to a certain size. That said, it also has an object to call APIs where you could call the Document Intelligence OCR model and receive the output. This is where a partnership with your Microsoft representatives like myself can come in handy. Helping you figure out which service is best for which use cases when starting on your journey.
Document Management with Fabric
Microsoft Fabric is an end to end data management platform. That used to mean managing structured data from sources like relational databases or CSV/JSON/XML files. In the Lakehouse feature, one of the many choices for storing data in Fabric, there is also a place to store unstructured files like PDFs and Images:
Notice that the Lakehouse has two sections, Tables and Files. Tables are where you store your structure data like the details extracted above like Invoice Number, Purchase Address and Total Amount. The files section is where you can store any type of file including PDFs, Word documents, Excel files, CSVs, images, and audio files. Notice that in the highlighted cms_raw folder there are several csv files and an image file. In the workflow discussed above, when it is time for automation you will need a place to pick up the files you want to process. In an on premises network situation you would normally put your files on either a Windows/Linux file server or something like an FTP server. In Azure you may put those files in ADLS Gen2 storage which is an enterprise grade storage mechanism allowing you to manage storage for quick retrieval or archival type retrieval where you can wait a bit longer to get your file, the latter allowing you to pay less for data that’s not needed immediately.
Microsoft Fabric’s Lakehouse storage is Azure ADLS Gen2 storage behind the scenes, but unlike the stand-alone Azure service, you do not have to manage it in any way. When you stand up a Lakehouse inside of Fabric, you get the storage and Files section with as little as seconds of setup time. This follows a BIG theme with Fabric, the ability to do data management with much less setup and management effort compared to past technologies. And unstructured data, because of Generative AI, can now be a much bigger part of that strategy.
There’s even a Windows desktop/virtual desktop interface into this Files section of the Fabric Lakehouse.
This means setting up an automation to copy files into Fabric can be as easy as setting up an automated copy between Windows Servers using Powershell, Batch commands, or any language/process of choice.
For those familiar with copying data from outside of Azure into a storage account you also have the option to interact with the URL of the Fabric files section as it has an ADLS Gen2 like endpoint. You can use well known utilities like AzCopy if taking this route.
And why else should you store your files in Fabric as step one of any process? How about data governance or finding content. This is where free parts of Microsoft Purview can be used to scan your Fabric deployment and then allow you to classify your folders and documents to help people find this potentially powerful data. Maybe there is no compelling AI use for a set of documents you would normally put on a file server today, but what about 3-6 months from now? Yes, Purview can scan other storage mechanisms, but why not make all of your important data, structured or not, available from one place?
This common AI pattern of extracting particular details from a document could be extended easily to classifying the document or summarizing the contents for easier consumption. All things that large language models are great at doing. But you have to realize that this is possible, be aware of the business problems to which these patterns can be applied and know the toolset that can be used to build an AI driven solution. Microsoft not only offers multiple ways to build these solutions but partners with you to build out an AI strategy, offers data management tools like Fabric and Purview, and has the knowledge of what works by partnering with similar organizations to yours that are already blazing trails with generative AI.
If the idea of reading a file and typing information from that file into a web form as any part of your job sounds exciting generative AI may not be for you. Since that is likely not fulfilling for most, hopefully this serves to help you help your organization see how human labor can be shifted to doing higher level things after replacing mundane tasks with automated or semi-automated processes like this. The document processing pattern discussed here can be applied to many places in a business. The first step is asking yourself, “Where can it be applied in mine?”.