🚀 Building a Local Document RAG System Using Node.js, Supabase, and OpenAI

Nitin Sharma

Published Nov 25, 2025

AI-powered search is exploding right now, and Retrieval-Augmented Generation (RAG) is at the heart of it. From chatbots that understand your company policies to internal knowledge assistants, RAG lets you query private documents using natural language—securely and accurately.

In this article, I’ll walk you through how I built a fully local document RAG application using:

Node.js (Express)
Supabase (Postgres + pgvector)
OpenAI Embeddings + Chat Models
pdf-parse (PDF text extraction)

📄 The system reads PDFs directly from a local MyDocs/ folder, processes them automatically, and stores embeddings in Supabase for fast semantic search.

🧑💻You can find this project code here:

https://github.com/NitinPandit/RAG-Document-Nodejs

Let’s break it down.

💡 What We’re Building

Imagine placing a PDF like Policies.pdf inside a folder. Now you want to ask:

“What is the leave policy?” “How many casual leaves are allowed?” “What is the work-from-home guideline?”

This RAG system:

Reads the PDF
Extracts text
Breaks text into overlapping chunks
Converts chunks into embeddings
Stores them in Supabase pgvector
Uses semantic search to find relevant chunks
Generates accurate answers using OpenAI

All from your local machine.

🏗️ Architecture Overview

Here’s how the flow works end-to-end:

MyDocs/ PDFs
       ↓
pdf-parse extracts text
       ↓
Chunking (1000 chars + 200 overlap)
       ↓
OpenAI Embeddings (1536-d vectors)
       ↓
Supabase (pgvector)
       ↓
Semantic Search (match_documents RPC)
       ↓
LLM (GPT-4.1-mini or any model)
       ↓
Answer to User

Simple, modular, and scalable.

📂 Folder Structure

Document-RAG-App/
│
├── MyDocs/
│   ├── Policies.pdf     # Local documents
│
├── index.js             # Main RAG backend
├── package.json
├── .env                 # API keys
└── README.md

Just drop your PDFs into MyDocs/ and hit the indexing API.

🔧 Setting Up the Project

Step 1: Install Dependencies

npm install express cors dotenv openai @supabase/supabase-js pdf-parse

Step 2: Environment Variables

Create .env:

OPENAI_API_KEY=your_openai_key
SUPABASE_URL=https://your-project-url.supabase.co
SUPABASE_ANON_KEY=your_anon_key
PORT=3000

🗄️ Supabase Setup (Vector DB)

Supabase provides Postgres + pgvector, a perfect fit for RAG apps.

1. Enable the vector extension

create extension if not exists vector;

2. Create the document table

create table if not exists "MyPDFDocuments" (
  id bigserial primary key,
  content text,
  embedding vector(1536),
  title text,
  source text,
  path text,
  created_at timestamptz default now()
);

3. Create a vector index (recommended)

create index if not exists mypdfdocuments_embedding_idx
on "MyPDFDocuments"
using ivfflat (embedding vector_cosine_ops)
with (lists = 100);

4. Add the semantic search RPC function

create or replace function match_documents(
  query_embedding vector(1536),
  match_threshold float,
  match_count int
)
returns table (
  id bigint,
  content text,
  similarity float
)
language plpgsql
as $$
begin
  return query
  select
    d.id,
    d.content,
    1 - (d.embedding <=> query_embedding) as similarity
  from "MyPDFDocuments" d
  where 1 - (d.embedding <=> query_embedding) > match_threshold
  order by similarity desc
  limit match_count;
end;
$$;

Recommended by LinkedIn

I built my first RAG system using Semantic Kernel…

Parthiban R 3 weeks ago

From Plain RAG to HyDE + reverse-HyDE on Semantic…

Ramesh Kanagaraj 7 months ago

Web Crawling & Extraction for AI Applications

Prasanna Pandian 1 year ago

📥 Indexing Local PDF Files

Your app exposes a single endpoint:

▶️ POST /index-docs

This:

Reads all PDFs in MyDocs/
Extracts text using pdf-parse
Generates embeddings
Pushes them to Supabase

Example:

curl -X POST http://localhost:3000/index-docs

You’ll see logs like:

Indexing PDF: MyDocs/Policies.pdf

❓ Ask Anything With Natural Language

▶️ POST /query

Request body:

{
  "query": "What is the leave policy?"
}

Example query:

curl -X POST http://localhost:3000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the leave policy?"}'

The app:

Embeds your question
Finds the most relevant chunks from Supabase
Sends those chunks as context to OpenAI
Returns a clean, helpful answer

Response:

{
  "answer": "According to the company leave policy..."
}

🧠 Understanding the RAG Pipeline

1. PDF Extraction (pdf-parse)

Extracts raw text from each page of the PDF.

2. Chunking

Chunks of:

1000 characters
200 character overlap

This overlap ensures context continuity during vector search.

3. Embeddings

Uses:

text-embedding-3-small
1536 dimensions

4. Vector Storage

Each chunk becomes a vector row in Supabase.

5. Semantic Search

Cosine similarity picks the most relevant chunks.

6. Final Answer

LLM generates a precise answer using those chunks.

🐛 Troubleshooting

To view or add a comment, sign in

🧑💻You can find this project code here:

💡 What We’re Building

🏗️ Architecture Overview

📂 Folder Structure

🔧 Setting Up the Project

Step 1: Install Dependencies

Step 2: Environment Variables

🗄️ Supabase Setup (Vector DB)

1. Enable the vector extension

2. Create the document table

3. Create a vector index (recommended)

4. Add the semantic search RPC function

Recommended by LinkedIn

📥 Indexing Local PDF Files

▶️ POST /index-docs

❓ Ask Anything With Natural Language

▶️ POST /query

🧠 Understanding the RAG Pipeline

1. PDF Extraction (pdf-parse)

2. Chunking

3. Embeddings

4. Vector Storage

5. Semantic Search

6. Final Answer

🐛 Troubleshooting

More articles by Nitin Sharma

🥊 MCP vs RAG: The Ultimate Battle for AI Dominance

TOON vs JSON: Can Token-Oriented Object Notation Replace JSON?

🔍 20 Agentic AI Terms Explained (With Definitions, Key Concepts & Real-World Relevance)

A Beginner’s Guide to the Future of Content Creation with Generative AI

Rethinking BI: Why External Dashboards Are No Longer Enough

Why Embedded BI Is the Secret Sauce for Modern SaaS Products

Exploring Product Management: Career Path and Key Skills

Database, Data Warehouse, and Data Lake: A Comparative Analysis

Blazor Interview Questions and Answer

Press release of itm aligarh students industrial visit at mcn professional

Others also viewed

Vectorless RAG: Build a PageIndex-Powered Retrieval System With Gemini 3 Flash and W&B Weave (No Embeddings, No Vector DB)

Connecting AI with employees and systems

Getting better at Sigma to KQL conversions

🤖 AI & ML in .NET 9: The New Building Blocks Every Developer Should Know (2025 Guide)

Building Your First Semantic Search App with Elasticsearch on Elastic Cloud

This is how Production grade Agentic Systems do RAG — A Multi-stage Retrieval | Hybrid RAG

Create a D365FO App Insight Agent to Execute KQL Queries with Natural Language

Beyond Chatbots: Building an Open Book Platform using Gemini, Postgres, and AI pipelines

JSON vs TOON — The Quiet Battle of Data Formats in the AI Era

Integrating OpenAI with Node.js: Simplifying AI-Powered Features in Your Applications

Similar topics

How to Use RAG Architecture for Better Information Retrieval

How to Build Intelligent Rag Systems

How to Improve RAG Retrieval Methods

Understanding the Role of Rag in AI Applications

How to Improve AI Using Rag Techniques

Explore content categories