Does machine learning slow down the consistent standardization required for digitization?

Jochem Schültke

Published Apr 27, 2019

The number of applications in which a combination of optical character recognition (OCR) and machine learning algorithms is used for analyzing text documents increases from year to year,

to find certain pieces of information,
extract them and
provide these pieces as structured data for subsequent automatic processing.

This approach is illustrated by the ExpenseIt feature, which is part of the SAP Concur App. I am using it for travel expense reporting. With my smartphone, I take a photo of a hotel invoice for example, which I received on paper. After analyzing the image, all relevant data is usually read out and automatically entered into the digital travel expense form. From time to time I have to correct or supplement single entries. This saves me the time to search for the required information in the invoice and type it in myself. The same happens when I receive an invoice as a PDF document and upload it. First it is converted into an image file (e.g. in png-format), which serves as input for machine learning based OCR.

In principle, such apps contribute to concealing fundamental deficits in digitization - instead of solving them. This can be easily understood using the example mentioned above. I had booked the hotel via the reservation website of a well-known provider with whom my employer had agreed good terms and conditions. My customer profile contains both business and personal information, including my credit card and email address. It would only be possible to speak of end-to-end digitization if I no longer had to fill in any details in a form by hand when checking in at the hotel and if no signature was even required. Additionally, I should also receive the hotel invoice by e-mail with a PDF/A-3 document as an attachment containing all the information as structured XML data in a standardized schema - preferably in ZUGFeRD 2.0 format.

This applies also to other invoices, receipts and further documents, which are often available in electronic form as PDF files, but do not contain any embedded XML data.

On the one hand, I like this kind of intelligent Optical Character Recognition (iOCR) as a user. On the other hand, the intensive use of machine learning algorithms slows down the consistent standardization that is necessary for the end-to-end digitization of the processes ... associated with the creation and use of such documents.

Only when all pieces of information contained in invoices, receipts, contracts and various other documents are made available from the very beginning as structured data (as within ZUGFeRD 2.0 invoices), each based on a standardized XML or JSON schema, true end-to-end digitization can be achieved. Otherwise, the existing gaps must continue to be bridged either manually - by reading and typing - or automatically with relatively high effort.

An essential prerequisite for comprehensive digitization in the two areas of Online Banking and Elster Tax Declarations was the definition of standardized data structures. Without them, the current status would not have been reached.

In my opinion, it is time to start standardizing structured data in other areas, too, which goes far beyond the topic of e-invoices - for example:

Employment contracts and pay slips
Rental contracts - also as landlord
Loan agreements
Statutory health and pension insurance
Company pension scheme and working-time account
Insurance policies
Subscriptions and memberships
Contracts with municipal utilities, water and energy suppliers
Contracts with telecom, internet or cable TV providers
Travel contracts and bookings
Software license agreements (e.g. Office 365)
Product registrations and warrantees

In order to promote digitization, I’m sure that public data dictionaries with country-specific characteristics would be very useful. They could simplify and accelerate the definition and use of structured data.

Every software manufacturer still defines its own data structures, independent of its competitors. From an economic point of view, this fact has been causing extremely high unnecessary costs for decades, which are now difficult to understand. If a company migrates its data from one system 'A' to another system 'B', the consequences of the Babylonian data confusion become obvious.

How easy could the digital transfer of contract information be, if standardized XML or JSON structures were used to ensure convenient data portability? What sense does it make today if every software manufacturer invents or uses its own IT definition of the data element "date of birth"?

#Digitalisierung #ZUGFeRD20 #eRechnung #Standardisierung #MachineLearning #OCR #OpticalCharacterRecognition #XML #JSON #StrukturierteDaten #SAP #SAPConcur #Concur #ExpenseIt

To view or add a comment, sign in

Does machine learning slow down the consistent standardization required for digitization?

Jochem Schültke

More articles by Jochem Schültke

Others also viewed

Exploring the Nuances of GenAI Model Training: Lessons from the Trenches

5 Ways Machine Learning is Transforming Manufacturing

Beyond the Chatbot: Unlocking Smarter AI with the Model Context Protocol (MCP)

Document intelligence

AI Transforms OCR to Usher in a New Era of Automation Benefits

Turning Generative AI Into Enterprise Intelligence:

Why 90% of Companies Are Using AI Wrong (And How Prompt Libraries Fix It)

Point of View: Training, Fine-Tuning, and Knowledge Stores for AI Agents in Business Processes

Explore content categories

More articles by Jochem Schültke

Do insurers want digital policies?

Werden sich Instant Payments neben Lastschriften bei Versicherungen etablieren?

Weg von einer einseitigen oder halben Digitalisierung – hin zu einer ganzheitlichen sowie durchgängigen Digitalisierung!

Wollen Versicherer digitale Policen?

Ich will kein weiteres Kundenportal zum 'Postabholen', sondern ein EmailPlus-Konto mit 'digitaler Power'

Über 10 Millionen PKV Rechnungen für ambulante Leistungen pro Jahr könnten digitalisiert werden.

Verzögert Machine Learning eine konsequente Standardisierung, die zur Digitalisierung erforderlich ist?

Online Shopping: digital SEPA Direct Debit Mandates are secure, convenient and economical - without handwritten signature

Online-Shopping: digitale SEPA-Lastschriftmandate ohne Unterschrift sind sicher, bequem und kostengünstig

Welche Banking-App wird die erste sein, die ZUGFeRD 2.0 Rechnungen auslesen kann?