Data Extraction from PDF Documents
ETL (database Extraction, Transformation, and Load) usually occurs from files designed for data migration: CSV, spreadsheets, or departmental databases (Access, Filemaker Pro, dBase, etc.). PDF files generally do not fall in that category - in some cases the publishers are particularly hostile to having any data extracted. Good examples of the latter are investment advisor portfolio recommendations or trade association membership directories. However, PDF format ‘documents of record’ are published by ‘practically everyone’ in the world, many of which contain one or more ‘data table(s)’.
It is easy to open a PDF, navigate to a page, highlight a block of text, ‘copy’ it to the clipboard, and paste it to some other application, whether NotePad or Excel. If the PDF is 8000 pages long, transferring data page by page this way is impractical. Some PDFs are ‘published’ daily, so it would be necessary to open the file each day, copy each page one by one, then paste it manually into the target system. Documents that have relevant content in spreadsheet-style ‘rows and columns’ format may not ‘paste’ as expected, particularly when a single cell containing multiple paragraphs spans multiple pages.
Two approaches to extracting data from PDFs are OCR and ‘decoding’ PDF content with PDF-centric libraries. ‘OCR’ typically means using libraries such as Open CV (Computer Vision) to extract text from raster graphics. This might ‘work’ but it may not produce a body of text in a convenient structure. ‘Decoding’ PDFs via libraries involves importing the entire file into a PDF object, then iterating through the pages collection converting compressed LZW binary streams to ASCII strings. The ASCII strings are made up of various kinds of ‘markup’ designed to specify fonts, canvas size, page orientation, the placement of text and graphics on the page, and the spacing of words and letters within text lines. The resulting ‘decoded’ text is a ‘stew’ of markup and embedded text.
PDF Publishers
Most ‘productivity’ software packages can publish to PDF - Microsoft Office, Libre Office, Corel/Word Perfect, etc. There are licensed and open source PDF libraries for conducting these operations under program control, particularly for merging collections of documents in various formats into a single PDF, or splitting a PDF into multiple extracts by page range or other criteria. When used to generate new documents, merge collections of documents together, or convert from other document file types, operations tend to be fairly straightforward. Picking text items or lines out of the markup for conversion to spreadsheets or databases, however, is full of pitfalls.
The screenshot above shows what a typical text stream looks like when it has been decompressed. This example is a ‘Final Tax Roll’ printout from a city in New York State. The ‘actual text’ is enclosed in parenthesis. Much of the surrounding markup controls the appearance or placement of the text. The text itself is in some cases broken up - text that should appear as an unbroken field is sometimes separated into chunks.
‘Tj’ and ‘TJ’ markup denotes end markers for text blocks, which in the extraction sample supplied here are translated to tildes (~) to act as field separators. ‘Td’ and ‘TD’ markers are translated to ‘new lines’, which in the case of Windows text files are translated to Carriage Return/Linefeeds. Red-bordered enclosures show examples of ‘unusually long’ strings split in arbitrary locations.
The extracted text appears as shown above, with the tilde field separators and line breaks. Some ‘field collections’ exist in one string, separated with padding blanks. In such cases it is necessary to identify distinct fields within these text strings and separate them appropriately.
The XML here shows the ‘producer’ of the document, in this case it appears to be Adobe Acrobat. At some point, however, this report had been imported or manipulated by Microsoft Word. The original document has the appearance of a mainframe printout on 14" wide fanfold paper, although this might also be scaled and rotated to landscape legal sized 14" x 8.5".
‘Tax roll’ documents in this format are published by a large number of cities and towns in New York State, and the report origins could be as varied as actual mainframe printouts, ‘Crystal Reports’, or original typed-in word processing documents. The resulting text streams are arranged in a variety of ways. There is no predictability between cities, or even in different years for the same city.
Public Records
Given that PDF documents are an attempt to be a ‘universal’ standard, government agencies at all levels publish certain of their records in this format. Usually this is the output format chosen when there is either little expectation of re-import to databases or other specialized applications, or the agency would prefer that such information not be imported, or even consumed by the public at all. While crime statistics are a necessary task in some government agencies, many governments would prefer that such information remain undigested, particular in comparison to neighboring or comparable jurisdictions.
AI training might be happy with ‘more or less’ accurate capture via OCR. An error rate of 1% or .1% might be acceptable. In situations where minor variations in spelling can have major effects, such as ‘wanted persons’ databases, even 1 transcription error in 10,000 records can have significant undesirable consequences. In such circumstances it is necessary to extract text as it is actually represented, and process it programmatically into the target records.
Many such documents are mostly ‘boilerplate text’, so extractable content is a (perhaps insignificant) subset of the entire document. Some reasonably reliable means is necessary to initiate extraction at the right point and recognize when all of the data has been captured, even if there are still multiple pages remaining in the document.
This is a ‘bread and butter’ ‘DevOps’ problem, often requiring dozens of separate extraction programs, each for a distinct document type, document publisher, or application purpose. Some are ‘run once’, but others are part of an ETL routine that periodically ‘wakes up’, downloads files, and runs ETL on the file or files.
In the 1960's, ‘mainframe’ programs ‘merged’ transaction punched card decks with magnetic tape or disk ‘master files’. PCs shifted this from the fixed 80 column card to the variable width text file, such as CSV. Spreadsheet data entry creates a better presentation, displaying dates and numeric fields with appropriate alignments, but still processed as clearly defined ‘rows and columns’. PDFs are a distinct type of ‘original input’, where data is mixed in with a ‘kitchen sink’ of company logos, freeform text, multi-media, 3D viewable objects, and so forth.
NOTE: The PDF Standard
The PDF format was originally defined by Adobe. This has now become an ISO standard and the standard is managed by the PDF Association (PDFa.org). The present ISO standard is ‘ISO-3200 -2:2020'. This document can be downloaded for free from PDFa.org, however anyone doing so has to register and agree not to distribute the document to anyone else. This specification is over 1000 pages long.