Building an Agentic Engineering RAG Pipeline and Application over a Weekend
AI generated

Building an Agentic Engineering RAG Pipeline and Application over a Weekend

Introduction

In late 2025, I caught up with an old college friend C Girish who is also a sharp data scientist. We were having nostalgic conversation about college days, and somehow the discussion shifted to AI and Machine Learning Application. Over coffee we debated a question that had been gnawing at me for months: Can an in‑house AI truly become agentic—not just answering questions, but actually reasoning, calculating and citing sources like an engineer?

I’d read theory papers and tinkered with simple retrieval‑augmented generation (RAG) demos, but many ideas remained abstract. C Girish had some experience with : tool orchestration, multi‑modal retrieval, stateful reasoning and verification pipelines. Something clicked. Over one weekend I went from dabbling in prototypes to architecting an end‑to‑end pipeline that could ingest complex engineering documents, extract formulas, perform calculations and produce plots on demand. That weekend project quickly grew into a full‑fledged Agentic RAG application.

This report documents the journey from theory to practice, explains why Agentic RAG is more than a buzzword, and describes how modern frameworks like LangGraph and Mistral 7B enable stateful, tool‑using agents. I am providing the sources I had to refer for developing the system.

Why Engineering Documents Are Challenging

Engineering PDFs, manuals and standards are notoriously messy. They combine prose with mathematical equations, cross‑referenced tables, diagrams and images. Text‑only parsers break formulas, merge table cells and lose the hierarchy of sections. Visual models are needed to recognize layout; specialized extractors preserve tables and formulas; and metadata must track page and section context. Neglecting layout and style results in loss of semantic information. To handle this complexity our pipeline starts with robust preprocessing and validation.

From Theory to Practice: Building the Pipeline

Preprocessing & Validation

Engineering documents come in various formats—born‑digital PDFs, scanned images, hybrid files—and may be hundreds of pages long. The pipeline therefore begins with validation:

  • Format & integrity checks: Verify PDF structure, page count and file size to avoid corrupt or oversized files.
  • Metadata extraction: Capture author, document type, version, sections and language.
  • Robust parsing: Use PyMuPDF for primary text extraction with pdfplumber and PyPDF as fallbacks. If text is missing or garbled, apply OCR (Tesseract) to scanned pages.
  • Formula detection: Identify LaTeX or MathML expressions for preservation.

Every attempt is logged, and if one parser fails another is tried. This multi‑tool resilience prevents catastrophic failures and records extraction quality metrics.

Semantic Chunking & Rich Metadata

Instead of splitting text at arbitrary character lengths, the pipeline uses semantic chunking:

  • Break at natural boundaries—sections, paragraphs or list items.
  • Preserve formulas and tables alongside their descriptions.
  • Track hierarchical relationships (section → subsection → paragraph) to enable context reconstruction.
  • Include overlaps so that cross‑sentential context is retained for retrieval.
  • Attach rich metadata such as document name, page number, section title, document type and extraction confidence.

This approach yields chunks that respect the document’s logic and support fine‑grained retrieval. A typical metadata object might include the fields shown below:

metadata = {   "doc_name": "XYZ_Standard.pdf",     "page": 7,     "section": "3.2 Flow Coefficient Equations",     "doc_type": "standard",     "version": "2012",     "contains_formula": True,     "contains_table": False,     "chunk_id": "doc_001_chunk_042",     "parent_chunk": "doc_001_chunk_041",     "confidence_score": 0.95, }

Specialized Extraction: Formulas & Tables

Standard text extraction misses formulas embedded as images and mangles table structures. To address this, the pipeline uses vision models and specialized tools:

  • Nougat (vision model): Nougat is a Visual Transformer that performs OCR on scientific documents and converts them to markup. By running a vision model over detected formula regions, formulas are output as LaTeX rather than fuzzy text.
  • Camelot / Tabula: These libraries extract tables from PDFs, preserving cell structure and metadata. Tables are stored as DataFrames for computation and as markdown for retrieval.
  • Multi‑pass verification: A three‑pass process—fast, deep and verification—ensures problematic pages are reprocessed with OCR and vision models and that outputs from different passes agree.

Quality Gates & Human‑in‑the‑Loop

Quality cannot be an afterthought when formulas and correlations feed engineering calculations. Four gates check every document:

  1. Extraction quality: Enforce a minimum text extraction ratio; flag files with missing formulas or suspiciously few tables relative to their type.
  2. Semantic validation: Use an LLM to validate extracted equations and units; reject hallucinations and inconsistent units.
  3. Cross‑reference validation: Verify that page references and section links remain intact.
  4. Duplicate detection: Use embeddings to merge near‑duplicate chunks across documents.

Chunks scoring below a confidence threshold are routed to a human reviewer. A web interface displays the PDF alongside the extracted text, allowing engineers to correct errors. Their feedback flows back into the pipeline, improving the models and extraction logic over time.

Retrieval & Embedding Optimization

Once documents are ingested and chunked, we build an index for retrieval:

  • Dense + sparse hybrid search: Combine dense embeddings (for semantic similarity) with BM25 or TF‑IDF scores (for keyword matches) to improve recall and precision.
  • Metadata filtering: Filter results by document type, section, language or the presence of formulas and tables.
  • Embedding cache & deduplication: Cache embeddings separately to accelerate reindexing and merge near‑duplicates to reduce noise.

Metadata Enrichment

Beyond basic metadata, an LLM extracts higher‑level concepts—key topics, equation names, referenced standards and cross‑cited pages. This enrichment enables advanced queries (e.g., “find all equations related to pressure drop in Section 3.2”) and supports automatic citation generation.

Monitoring & Continuous Improvement

Structured logs track metrics such as formula accuracy, table accuracy, average confidence, manual review rate and reprocessing rate. A monitoring dashboard flags regressions and highlights which documents or sections need retraining or pipeline tweaks.


Architecting Agentic Workflows

Building a pipeline is only half the story; making the AI agentic requires orchestrating multiple tools and reasoning steps. Inspired by Agentic RAG principles, the system implements a stateful workflow:

retrieve → verify → calculate → plot → cite

  1. Retrieve: Query the vector store for relevant chunks with context.
  2. Verify: Check formulas, units and tag names against the retrieved sources; reject hallucinations.
  3. Calculate: Substitute numbers into equations, perform unit conversions, and show step‑by‑step calculations.
  4. Plot: Generate engineering plots (e.g., pressure vs. temperature) using Python libraries; embed these into the response.
  5. Cite: Provide inline citations with page and section references so the user can verify every statement.

An AI agent that follows this workflow isn’t just a chatbot; it is a junior engineer that can retrieve, reason, calculate and justify its answers.


Tools & Technology Behind the Pipeline

Nougat, Vision Models & Table Extraction

Extracting formulas from images requires more than OCR. Nougat, a Visual Transformer model, converts scanned documents into LaTeX markup, overcoming limitations of line‑based OCR. For tables, Camelot and Tabula detect cell boundaries and export DataFrames; these structured representations are used for calculations and cross‑section comparisons.

LayoutLM: Respecting Layout & Style

Conventional language models treat text sequentially and ignore layout. LayoutLM jointly models text and layout information in documents. It introduces 2‑D positional embeddings and image features to capture the spatial arrangement of tokens. LayoutLM achieves state‑of‑the‑art results on form understanding and receipt understanding tasks[3]. In the pipeline, LayoutLM helps detect section boundaries and semantic groupings, enabling smarter chunking and preserving context.

LangGraph: Orchestration & Durability

Agentic workflows require a runtime capable of managing state, tool calls and long‑running processes. LangGraph is a low‑level orchestration framework designed for stateful agents. It provides durable execution, streaming, human‑in‑the‑loop integration and comprehensive memory. LangGraph does not abstract away prompts; it focuses on executing graphs of nodes (LLM calls or tool invocations) and handling transitions. The pipeline uses LangGraph to implement the retrieve–verify–calculate–plot–cite loop, resume processes after interruptions and allow human inspection of intermediate states.

Mistral 7B: Powering the LLM Layer

For the language model layer we selected Mistral 7B, a 7.3‑billion‑parameter model that outperforms Llama 2 13B across benchmarks. It uses Grouped‑query attention and Sliding Window Attention for efficient long‑sequence processing. Mistral 7B is released under the Apache 2.0 license, enabling free commercial use. Its small footprint allows deployment on local hardware while offering competitive reasoning and coding performance[8].

Mistral 7B was fine‑tuned on instruction datasets, yielding a chat model that rivals larger 13B models. In our application it handles queries, interprets equations, generates reasoning steps and produces plots—all while citing source chunks.

Building the Full‑Fledged Application

After refining the pipeline, we integrated it into a web application. The stack includes:

  • Backend: Django serves API endpoints, manages users, stores ingestion metadata and handles scheduling of extraction jobs. LangGraph orchestrates the LLM and tool calls, while LangChain provides connectors to vector stores and embeddings.
  • LLM & Agents: Mistral 7B serves as the core LLM, driving reasoning and generation. LangGraph builds the agent graph connecting retrieval, verification, calculation, plotting and citation nodes.
  • Frontend: Bootstrap, CSS and JavaScript provide a responsive UI. Engineers can upload documents, issue queries and view results with formulas rendered in LaTeX, plots embedded inline and citations linking back to the source.
  • Additional services: A vector store (e.g., FAISS) stores embeddings; a graph database holds cross‑reference relationships; and a job scheduler processes new documents incrementally.

Key Features

  • Document ingestion: Upload standards, manuals, research papers or scanned documents and ingest them through the multi‑pass pipeline.
  • Formula & correlation retrieval: Search for specific formulas and correlations within ingested documents; the system surfaces them with context and citations.
  • Engineering calculations: Insert variables into retrieved equations, perform unit conversions and compute results step by step.
  • Plot generation: Create engineering plots (e.g., pressure–temperature curves) directly from input values and retrieved formulas, returning interactive charts in the UI.
  • Agentic reasoning: The agent decides which tool to call next, when to verify, and when to ask for more context. It can route out‑of‑scope questions to external search or to a failsafe handler, reflecting the architecture of Agentic RAG.
  • Traceability: Every answer includes page‑level citations and a link to the source, enabling engineers to verify or challenge the AI’s reasoning.

Architecture & Tech Stack Summary

Article content
Indicative Only, many other libraries were utilized for full production.

Lessons Learned & Impact

Agentic RAG vs Traditional RAG

Traditional RAG pipelines simply retrieve chunks and append them to the LLM prompt. Agentic RAG incorporates an intelligent agent that can decide which database to use, how to route a query, call APIs and evaluate results. This makes retrieval more accurate, responsive and adaptable. Our experiment confirmed these benefits: the agent could choose between different document collections, decide to perform a calculation or search the web, and gracefully handle out‑of‑scope requests.

For our case, I developed 2 small tools that can write small python scripts for calculations ans another for plot generation:

The Value of Interdisciplinary Collaboration

This project began with a conversation between two friends—one steeped in AI theory, the other grounded in engineering practice. Combining insights from AI, signal processing, software engineering and process engineering produced a system no single discipline could have built alone. Conversations across domains are catalysts for innovation.

The Future of Engineering AI

The success of this pipeline and application suggests a new paradigm for engineering AI:

  • Document‑grounded reasoning: Models must use original documents as their foundation, not just rely on pretraining. Tools like LayoutLM prove that modelling text and layout jointly yields better understanding.
  • Agentic orchestration: Stateful agents that can plan, call tools, verify results and update context outperform single‑shot prompts. LangGraph and LangChain provide the infrastructure to build these systems.
  • Fine‑tuned, open models: Open‑source models like Mistral 7B deliver strong performance in reasoning and code while remaining flexible and licence‑free.
  • Human oversight: HITL remains critical in high‑stakes domains. Human reviewers correct extraction errors, provide feedback for model retraining and ensure safe deployment.

Conclusion

What began as a weekend curiosity blossomed into a production‑ready Agentic RAG application. By addressing the messy reality of engineering documents, adopting vision models and layout‑aware transformers, and orchestrating a stateful agent with LangGraph, the system retrieves formulas and correlations, performs calculations, generates plots and cites sources—all through a web interface. Behind the scenes, the 7‑billion‑parameter Mistral 7B model powers reasoning while Django and Bootstrap deliver a polished user experience. The result is not just a chatbot but a trustworthy assistant for engineers, marking a step towards document‑grounded, multi‑modal, agentic AI systems.

References:

  1. LayoutLM: Pre-training of Text and Layout for Document Image Understanding https://arxiv.org/pdf/1912.13318
  2. (PDF) Nougat: Neural Optical Understanding for Academic Documents https://www.researchgate.net/publication/373437974_Nougat_Neural_Optical_Understanding_for_Academic_Documents
  3. LangGraph overview - Docs by LangChain https://docs.langchain.com/oss/python/langgraph/overview
  4. Mistral 7B | Mistral AI https://mistral.ai/news/announcing-mistral-7b
  5. What is Agentic RAG ? The Architecture https://www.deepchecks.com/glossary/agentic-rag/


That's great Phanishwar Kumar. It's very Inspiring and Intimidating at the same time.

Like
Reply

Phanishwar Kumar I am giving your article as input to notebook LM to summarise in 5 points

To view or add a comment, sign in

Others also viewed

Explore content categories