Automating Invoice Processing with OCR + LLMs + FastAPI + Streamlit

Nagaraja Kharvi

Published Dec 31, 2025

Over the last few days, I built a small but powerful end-to-end invoice automation system — from PDF upload to structured data, with duplicate detection to prevent fraud and re-processing.

The goal was simple:

eliminate manual data entry
automatically extract invoice fields
block duplicate invoices
expose APIs and UI for business users

What the system does

Upload invoice PDFs
Extract text using OCR
Use LLM to parse key fields (invoice no, vendor, date, total)
Store invoices in SQLite
Detect duplicate invoices via checksum
View stored invoices in Streamlit dashboard
Built using FastAPI backend APIs

This turned into a very practical AI + backend engineering exercise applicable to AP automation and fintech workflows.

High-Level Architecture

+------------------+
|  PDF Invoice     |
|  Upload (UI)     |
|  Streamlit       |
+--------+---------+
         |
         v
+------------------+
| FastAPI Backend  |
+------------------+
         |
         v
+------------------+       +-------------------+
|  OCR Layer       |       | LLM Extraction    |
| (PDF -> text)    |-----> | (Field Parsing)   |
+------------------+       +-------------------+
         |
         v
+------------------------------+
| Duplicate Detection Module   |
| (checksum + DB lookup)       |
+------------------------------+
         |
         v
+------------------------------+
|  SQLite Database             |
|  invoices table              |
+------------------------------+
         |
         v
+------------------------------+
| Streamlit Viewer             |
| list + search invoices       |
+------------------------------+

Key AI Component — LLM Field Extraction

Instead of brittle regex-only parsing, an LLM helps map messy OCR text into structured JSON fields such as:

invoice_number
vendor
invoice_date
total_amount

This allows support for:

✔ different invoice layouts ✔ different vendors ✔ noisy OCR text

🔐 Duplicate Invoice Detection (very important)

To prevent:

double payment
invoice fraud
resubmissions

I added:

PDF text checksum
DB uniqueness constraint

Flow:

extract text
compute SHA256 hash
check if checksum already exists
if exists → block storage and flag duplicate
else insert into DB

This mirrors real AP automation systems.

What I learned

OCR + LLMs complement each other
Databases still matter in AI projects
Duplicate detection must be designed early
Small automation = big business value
FastAPI + Streamlit = super quick prototyping stack

Tech Stack

FastAPI
Streamlit
SQLite + SQLAlchemy
Python OCR
LLM prompt parsing
REST APIs

✅ Real-world applications

Accounts Payable Automation
Fintech onboarding
Expense invoice verification
Procurement systems
ERP integrations

https://github.com/nagrgkgc/invoice_ai

Automating Invoice Processing with OCR + LLMs + FastAPI + Streamlit

Nagaraja Kharvi

What the system does

High-Level Architecture

Key AI Component — LLM Field Extraction

🔐 Duplicate Invoice Detection (very important)

What I learned

Tech Stack

✅ Real-world applications

More articles by Nagaraja Kharvi

Explore content categories

What the system does

High-Level Architecture

Key AI Component — LLM Field Extraction

🔐 Duplicate Invoice Detection (very important)

What I learned

Tech Stack

✅ Real-world applications

More articles by Nagaraja Kharvi

The AI Control Tower: How Modern E-commerce Companies Should Run Their Operations

AI Is No Longer an IT Project — It’s a Leadership Decision

Applying ML & AI to Real E-commerce Systems

From “Churn Prediction” to CausalRetention_AI

ChurnGuard_AI - AI-Powered Ecommerce Customer Churn Prediction & Retention Recommender

Building a Dynamic Pricing Optimizer with AI/ML

Email → Ticket Automation using AI (Mail2Ticket)

I Built an AI-Powered Multimodal Product Recommendation Engine (Text + Image)

SearchFusion AI — Multimodal Product Search that Understands Text + Image + User Intent

Multimodal AI for E-commerce: Predicting Sales & Returns with Product Text, Images & Reviews

Explore content categories