Automating Invoice Processing with OCR + LLMs + FastAPI + Streamlit

Automating Invoice Processing with OCR + LLMs + FastAPI + Streamlit

Over the last few days, I built a small but powerful end-to-end invoice automation system — from PDF upload to structured data, with duplicate detection to prevent fraud and re-processing.

The goal was simple:

  • eliminate manual data entry
  • automatically extract invoice fields
  • block duplicate invoices
  • expose APIs and UI for business users

What the system does

  • Upload invoice PDFs
  • Extract text using OCR
  • Use LLM to parse key fields (invoice no, vendor, date, total)
  • Store invoices in SQLite
  • Detect duplicate invoices via checksum
  • View stored invoices in Streamlit dashboard
  • Built using FastAPI backend APIs

This turned into a very practical AI + backend engineering exercise applicable to AP automation and fintech workflows.

Article content
Local System Test



High-Level Architecture

+------------------+
|  PDF Invoice     |
|  Upload (UI)     |
|  Streamlit       |
+--------+---------+
         |
         v
+------------------+
| FastAPI Backend  |
+------------------+
         |
         v
+------------------+       +-------------------+
|  OCR Layer       |       | LLM Extraction    |
| (PDF -> text)    |-----> | (Field Parsing)   |
+------------------+       +-------------------+
         |
         v
+------------------------------+
| Duplicate Detection Module   |
| (checksum + DB lookup)       |
+------------------------------+
         |
         v
+------------------------------+
|  SQLite Database             |
|  invoices table              |
+------------------------------+
         |
         v
+------------------------------+
| Streamlit Viewer             |
| list + search invoices       |
+------------------------------+
        

Key AI Component — LLM Field Extraction

Instead of brittle regex-only parsing, an LLM helps map messy OCR text into structured JSON fields such as:

  • invoice_number
  • vendor
  • invoice_date
  • total_amount

This allows support for:

✔ different invoice layouts ✔ different vendors ✔ noisy OCR text


🔐 Duplicate Invoice Detection (very important)

To prevent:

  • double payment
  • invoice fraud
  • resubmissions

I added:

  • PDF text checksum
  • DB uniqueness constraint

Flow:

  1. extract text
  2. compute SHA256 hash
  3. check if checksum already exists
  4. if exists → block storage and flag duplicate
  5. else insert into DB

This mirrors real AP automation systems.


What I learned

  • OCR + LLMs complement each other
  • Databases still matter in AI projects
  • Duplicate detection must be designed early
  • Small automation = big business value
  • FastAPI + Streamlit = super quick prototyping stack


Tech Stack

  • FastAPI
  • Streamlit
  • SQLite + SQLAlchemy
  • Python OCR
  • LLM prompt parsing
  • REST APIs


✅ Real-world applications

  • Accounts Payable Automation
  • Fintech onboarding
  • Expense invoice verification
  • Procurement systems
  • ERP integrations

https://github.com/nagrgkgc/invoice_ai

How would you tackle invoicing in different formats ?

Like
Reply

To view or add a comment, sign in

More articles by Nagaraja Kharvi

Explore content categories