Does machine learning slow down the consistent standardization required for digitization?

Does machine learning slow down the consistent standardization required for digitization?

The number of applications in which a combination of optical character recognition (OCR) and machine learning algorithms is used for analyzing text documents increases from year to year,

  • to find certain pieces of information,
  • extract them and
  • provide these pieces as structured data for subsequent automatic processing.

This approach is illustrated by the ExpenseIt feature, which is part of the SAP Concur App. I am using it for travel expense reporting. With my smartphone, I take a photo of a hotel invoice for example, which I received on paper. After analyzing the image, all relevant data is usually read out and automatically entered into the digital travel expense form. From time to time I have to correct or supplement single entries. This saves me the time to search for the required information in the invoice and type it in myself. The same happens when I receive an invoice as a PDF document and upload it. First it is converted into an image file (e.g. in png-format), which serves as input for machine learning based OCR.

In principle, such apps contribute to concealing fundamental deficits in digitization - instead of solving them. This can be easily understood using the example mentioned above. I had booked the hotel via the reservation website of a well-known provider with whom my employer had agreed good terms and conditions. My customer profile contains both business and personal information, including my credit card and email address. It would only be possible to speak of end-to-end digitization if I no longer had to fill in any details in a form by hand when checking in at the hotel and if no signature was even required. Additionally, I should also receive the hotel invoice by e-mail with a PDF/A-3 document as an attachment containing all the information as structured XML data in a standardized schema - preferably in ZUGFeRD 2.0 format.

This applies also to other invoices, receipts and further documents, which are often available in electronic form as PDF files, but do not contain any embedded XML data.

On the one hand, I like this kind of intelligent Optical Character Recognition (iOCR) as a user. On the other hand, the intensive use of machine learning algorithms slows down the consistent standardization that is necessary for the end-to-end digitization of the processes ... associated with the creation and use of such documents.

Only when all pieces of information contained in invoices, receipts, contracts and various other documents are made available from the very beginning as structured data (as within ZUGFeRD 2.0 invoices), each based on a standardized XML or JSON schema, true end-to-end digitization can be achieved. Otherwise, the existing gaps must continue to be bridged either manually - by reading and typing - or automatically with relatively high effort.

An essential prerequisite for comprehensive digitization in the two areas of Online Banking and Elster Tax Declarations was the definition of standardized data structures. Without them, the current status would not have been reached.

In my opinion, it is time to start standardizing structured data in other areas, too, which goes far beyond the topic of e-invoices - for example:

  • Employment contracts and pay slips
  • Rental contracts - also as landlord
  • Loan agreements
  • Statutory health and pension insurance
  • Company pension scheme and working-time account
  • Insurance policies
  • Subscriptions and memberships
  • Contracts with municipal utilities, water and energy suppliers
  • Contracts with telecom, internet or cable TV providers
  • Travel contracts and bookings
  • Software license agreements (e.g. Office 365)
  • Product registrations and warrantees

In order to promote digitization, I’m sure that public data dictionaries with country-specific characteristics would be very useful. They could simplify and accelerate the definition and use of structured data.

Every software manufacturer still defines its own data structures, independent of its competitors. From an economic point of view, this fact has been causing extremely high unnecessary costs for decades, which are now difficult to understand. If a company migrates its data from one system 'A' to another system 'B', the consequences of the Babylonian data confusion become obvious.

How easy could the digital transfer of contract information be, if standardized XML or JSON structures were used to ensure convenient data portability? What sense does it make today if every software manufacturer invents or uses its own IT definition of the data element "date of birth"?

© Jochem Schültke, all rights reserved

#Digitalisierung #ZUGFeRD20 #eRechnung #Standardisierung #MachineLearning #OCR #OpticalCharacterRecognition #XML #JSON #StrukturierteDaten #SAP #SAPConcur #Concur #ExpenseIt

To view or add a comment, sign in

More articles by Jochem Schültke

Others also viewed

Explore content categories