Google Open-Sources LangExtract for Accurate Document Extraction

2mo

🚀 Breaking: Google just dropped LangExtract! Tired of extracting informatiin from messy, unstructured documents with high accuracy? Google just open-sourced LangExtract, a Python library designed to pull structured data with surgical precision. Whether it’s clinical notes, legal contracts, or complex reports/documents, you can now transform "wall of text" chaos into clean, usable data in just a few lines of code. Why this is a game-changer for devs: • 📍 Source Grounding: It doesn't just extract data; it maps every single entity back to its exact source location in the document. No more "black box" hallucinations—you can audit every result. • 📐 Schema Enforcement: Define your output once. LangExtract ensures consistent, structured JSON that actually fits your database. • ⚡ Built for Scale: Handles massive documents with ease using parallel processing and smart chunking. • 📊 Visual Validation: It automatically generates interactive HTML visualizations, letting you see the extractions highlighted directly on the original text. • 🤖 Model Agnostic: It’s not just for Google Gemini. It works with Ollama, local open-source models, and even OpenAI. • 🧠 Few-Shot Power: No fine-tuning required. It learns your specific domain (medical, finance, manufacturing etc.) with just a few examples. The best part? It’s completely open source. No hidden API fees, no usage limits, and full transparency. Ready to stop parsing and start extracting? 🔗 https://lnkd.in/g6gw6-M8 #AI #Python #OpenSource #DataScience #LLM #GoogleAI #MachineLearning #DocumentExtraction

GitHub - google/langextract: A Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive visualization. github.com

1 Comment

Johnny Nel . 2mo

This tool bridges the implementation gap perfectly. I've tested similar extraction workflows, and the source grounding feature addresses trust concerns non-tech founders have with AI outputs.

To view or add a comment, sign in

More Relevant Posts

Mounika Bandhyala
2mo
Report this post
We all know how challenging it can be to turn messy, unstructured documents into clean, usable data. It is often one of the most time-consuming parts of data engineering. I recently came across a new open-source library from Google called LangExtract, and I thought it was worth sharing with this network. It uses LLMs to help extract structured data, but what I really appreciate is its focus on transparency. Unlike many other tools, it provides "grounding" for every piece of information it extracts—meaning it points you exactly to where in the text the data came from. For anyone working in fields where accuracy and traceability are critical (like healthcare, legal, or finance), this feels like a very thoughtful solution. It also handles long documents gracefully, which is a nice bonus. If you have been looking for a more reliable way to handle unstructured text, I highly recommend checking out the repository. Link: https://lnkd.in/gb6VMJRn #DataEngineering #OpenSource #MachineLearning #Python #Google #AI

GitHub - google/langextract: A Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive visualization. github.com

2 Comments
Like Comment
To view or add a comment, sign in
Sneha Grian Joshua
2mo
Report this post
Your API isn’t slow. It’s telling you something. A few weeks ago, I started thinking about how we debug performance issues. When response times spike, we open dashboards. When errors increase, we check logs. And then begins the familiar ritual, scrolling through hundreds of lines trying to connect patterns. But logs already contain the story. We just don’t always read them the right way. So I built an AI-Powered API Performance Bottleneck Analyzer. It’s a FastAPI microservice that: • Analyzes structured API logs • Detects P95 latency spikes and high error rates • Flags database-heavy patterns like N+1 queries • Uses GPT-4 to explain what might actually be happening Instead of just saying “this endpoint is slow,” it suggests practical next steps — indexing, caching, async processing based on the behavior it detects. What I enjoyed most wasn’t the AI part. It was understanding how performance, architecture, and observability connect beneath the surface. This project made me look at logs differently. Built with FastAPI, Python, GPT-4. Curious what backend engineers think about AI-assisted performance analysis 👇 https://lnkd.in/ej4_pxbe #BackendEngineering #SystemDesign #FastAPI #OpenAI #PerformanceEngineering

GitHub - snehagrian/API-PBA: AI-powered API performance analyzer that detects bottlenecks and provides GPT-4 recommendations. FastAPI + OpenAI for instant log analysis. github.com
Like Comment
To view or add a comment, sign in
Aishwarya Pachaiyappan
2mo
Report this post
One recent release that excited me this week: Google’s LangExtract I try to share AI/ML updates that are actually useful for engineers building real systems, and LangExtract is a great example of that. What is LangExtract? This is a new open-source Python library from Google that extracts structured data from unstructured text using LLMs - with traceability, schema control, and visualization built in. Why this excites me as an AI/ML engineer? 🔹 Precise source grounding Every extracted entity maps back to the exact text span, making AI outputs auditable and explainable - critical for enterprise and regulated use cases. 🔹 Reliable structured outputs with schema enforcement You define the schema and examples, and the model follows it - reducing hallucinations and making downstream pipelines production-friendly. 🔹 Optimized for long documents Uses chunking, parallel processing, and multi-pass extraction to handle large documents (a real pain point in RAG and document AI systems). 🔹 Interactive visualization Automatically generates HTML visualizations to review extracted entities directly in the original text ,great for debugging and human-in-the-loop workflows. 🔹 Flexible LLM support Works with cloud models like Gemini, OpenAI models, and local models via Ollama, with a plugin system for custom providers. 🔹 Few-shot extraction without fine-tuning You can define tasks with examples instead of training custom models - super practical for domain-specific extraction. Where I see this being useful? * Enterprise RAG pipelines * Document intelligence platforms * Knowledge graph generation * Healthcare, legal, and finance compliance workflows * Large-scale enterprise data ingestion pipelines As AI moves from chatbots to production-grade data infrastructure, tools like this make structured, traceable, and scalable AI pipelines much more practical. I’ll keep sharing AI tools, system design insights, and updates that genuinely excite me and are useful for AI engineers building real-world systems. Link here - https://lnkd.in/gzrNjaH5 #AI #MachineLearning #LLM #DataEngineering #SystemDesign #GoogleAI #OpenSource #GenerativeAI

GitHub - google/langextract: A Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive visualization. github.com
Like Comment
To view or add a comment, sign in
Shantnu Bhalla
2mo
Report this post
Okay, so microGPT just entered the chat, a full GPT in 243 lines of pure Python. No PyTorch, no dependencies. Andrej Karpathy words: "This is the full algorithmic content. Everything else is just efficiency." That "everything else" is what I want to talk about. Because before a single token is generated, data has to be ingested, stored, and retrieved - fast. The more I build with AI systems, the more I see the same truth repeat: the bottleneck is rarely the model. It's the system around it. When you strip away the hype, a RAG pipeline or an Agent workflow is ultimately a read/write challenge. And if you don't understand how your storage layer behaves under pressure, your AI system will choke at scale. B-Trees vs. LSM Trees - how would one choose? Most of us reach for Postgres by default, and for good reason. - B-Trees (Postgres, MySQL) update pages in-place on disk. They are optimized for consistent reads and transactional workloads, great for transactional systems like ledgers, user accounts, and relational workloads. But under sustained write pressure like logs, embeddings, event streams, random I/O gets expensive. That's where LSM Trees come in. - LSM Trees (RocksDB, Cassandra, many modern storage engines) buffer writes in memory, append sequentially to disk, and compact in the background. You trade read amplification for massive ingestion throughput. The practical middle ground? Maybe use a log (e.g., Kafka) to absorb write spikes, then asynchronously persist consistency-critical data into a transactional store. You get burst capacity without sacrificing correctness. If you want to go deeper on storage internals, check out this cool video by Ben Dicken: https://lnkd.in/e-JWaXps Scaling AI isn't just about picking the right model. It's about understanding storage engines, I/O patterns, data structures, and compaction strategies. The infrastructure under the model is the model's ceiling. #SoftwareEngineering #DatabaseInternals #SystemDesign #BackendDevelopment #DistributedSystems #AIEngineering #Postgres #ComputerScience #TechTalks
Like Comment
To view or add a comment, sign in
Somesh Ramesh Ghaturle
2mo
Report this post
🔥 I just built something most ML engineers never bother to prove: Vector Databases are NOT always the answer. I built a full document retrieval benchmark system from scratch, comparing: Vector DB search (FAISS + SentenceTransformers) vs Index-based retrieval (PageIndex using TF-IDF) …and evaluated them on real, production-relevant metrics. 📊 What the data actually showed: 1. Avg Query Latency → PageIndex: 0.24 ms ✅ → Vector DB: 6.1 ms 2. Index Build Time → PageIndex: 0.004 s ✅ → Vector DB: 1.88 s 3. Memory Usage → PageIndex: 105 MB ✅ → Vector DB: 420 MB 4. Semantic Recall → PageIndex: — → Vector DB: Better ✅ 🧠 The takeaway no one tells you: PageIndex was ~25× faster and used ~4× less memory, with equal recall on keyword-heavy queries. Dense embeddings do win for semantic understanding. But if your queries are structured or keyword-rich, you may be burning GPU compute, memory, and latency for zero gain. 🛠️ What I built (end-to-end): ✅ TF-IDF pipeline with cosine similarity retrieval ✅ FAISS semantic search using all-MiniLM-L6-v2 ✅ Evaluation suite: Recall@K, Precision@K, MRR, P95 latency ✅ Memory profiling at every pipeline stage using psutil ✅ Web app to upload PDFs, run queries, and compare both systems side-by-side ✅ Reproducible benchmark outputs exported to CSV 🚀 Why this matters for real production systems: Before paying for Pinecone, Chroma, or Weaviate, benchmark your actual query patterns. In many real-world systems: Keyword-first + semantic fallback Hybrid retrieval often outperforms pure vector search at a fraction of the cost. This kind of tradeoff analysis is what separates engineers who use tools from engineers who understand systems. 🔗 GitHub: [https://lnkd.in/edRH4xcA] Tech stack: Python · FAISS · SentenceTransformers · scikit-learn · Django · PyMuPDF · psutil · pandas
3 Comments
Like Comment
To view or add a comment, sign in
Naveengandhi C.
2mo
Report this post
RAG in 50 Lines of Python. And It Actually Works. If you’re trying to understand Retrieval-Augmented Generation without drowning in theory, a simple Python + LangChain setup makes it click fast. At its core, RAG is not magic. It’s two components: An index. A language model. First, you split your documents into chunks. Then you generate embeddings. Store them in a vector database like Chroma. At query time, you embed the user’s question, retrieve the most similar chunks, and inject them into the prompt before calling the LLM. That’s it. The beauty of the “naive” RAG flow is how transparent it is. Vector store -> similarity search -> retriever Retriever output -> formatted context -> prompt Prompt -> LLM -> grounded answer When you add a strict instruction like “Only answer based on the provided context,” you can clearly see the difference between grounded responses and hallucination. Ask it something outside the indexed data, and it correctly says, “I don’t know.” That’s the moment RAG stops being hype and becomes engineering. What I like about the Python + LangChain approach is how composable it is. Chroma handles storage. OpenAI handles embeddings and generation. LangChain wires the flow together cleanly. No heavy infrastructure. No overengineering. Just chunk -> embed -> store -> retrieve -> generate. From there, you can evolve: Better chunking strategies Metadata filtering Hybrid search Reranking Caching Evaluation pipelines The naive version is small. The production version is architecture. If you’re learning AI integration, don’t start with agents. Start with a simple RAG loop and understand every line. That’s where real intuition forms. Explore more : https://lnkd.in/gUdZjByu #Python #Java #ML #DevOps #Data #LangChain #RAG #LLM #AI #VectorSearch #OpenAI #C2C Robert Half Boston Consulting Group (BCG) Kforce Inc Michael Page Aquent Motion Recruitment Huxley UST Matlen Silver Synechron CyberCoders Saicon COGENT Infotech Photon IntraEdge Gardner Resources Consulting, LLC Software Guidance & Assistance, Inc. (SGA, Inc.) Xoriant Genpact Modis BayOne Solutions

Build a RAG agent with LangChain - Docs by LangChain docs.langchain.com
Like Comment
To view or add a comment, sign in
Ryan Hadi

Senior Data Analyst | Turning Property & Healthcare Data into Revenue Decisions | SQL · Python · Power BI
2mo
Report this post
Dear Data Analyst, You are not paid to build models. You are paid to reduce uncertainty. You learned Python, SQL, prompt engineering. You fine tune LLMs. You deploy RAG pipelines. You optimize token usage. Then the CEO asks a basic question. "Will this feature increase retention next quarter?" And the answer is buried in a notebook no one reads. AI is not about larger models or cleaner embeddings. It is about changing a decision before money is spent. If your chatbot has 95 percent accuracy but no one uses its output, it is a demo. If your forecast shifts budget allocation by 12 percent and improves margin, that is IMPACT. A strong AI function is not measured by experiments run. It is measured by decisions altered. Stop optimizing models in isolation. Start optimizing consequences.

9 Comments
Like Comment
To view or add a comment, sign in
Smit Shah
2mo
Report this post
The era of $50K enterprise document extraction software is over. Google just killed it with a few lines of Python. It's called LangExtract - and it just crossed 31,000+ GitHub stars in 7 months. Here's why every backend engineer should know about it: The problem it solves: → Unstructured text (clinical notes, contracts, reports) → Structured, auditable JSON → With CHARACTER-LEVEL source grounding That last part is the game changer. Every single entity it extracts maps back to exact start:end character offsets in your source document. Character-level grounding means you can verify every extraction against the source - no blind trust required. Full provenance trail. How the pipeline works: 1. Raw text → Smart chunking (sentence-aware splits via max_char_buffer) 2. Parallel LLM extraction across chunks (up to 20 workers) 3. Multi-pass extraction for maximum recall (extraction_passes=3) 4. Source grounding → exact CharInterval(start, end) per entity 5. Deduplicated, schema-consistent structured output 6. Interactive HTML visualization for human review What makes it different: → Character-level provenance - click any entity, see exactly where it came from → Zero fine-tuning - natural language prompt + few-shot examples is all you need → 6+ LLM backends - Gemini, OpenAI, Ollama, vLLM, Outlines, llama-cpp-python → Built for regulated industries - healthcare, legal, finance → Interactive visualization - self-contained HTML files for reviewing entities in context → Apache 2.0 - completely free The quick version: import langextract as lx result = lx.extract( text_or_documents=text, prompt_description=prompt, examples=examples, model_id="gemini-2.5-flash" ) # → AnnotatedDocument with CharInterval(start, end) per entity After 17+ years building backend systems that process financial documents, this is the extraction library I wish existed a decade ago. If you're processing unstructured text at scale - especially in regulated industries where "where did this data come from?" matters - LangExtract is worth your weekend. 🔗 https://lnkd.in/dAKSMbUi Which use case would you try this on first? Drop it below 👇 Repost if someone in your network processes documents for a living.
Like Comment
To view or add a comment, sign in
Abhishek Chawda
1mo
Report this post
🚀 𝐓𝐡𝐞 𝐏𝐲𝐭𝐡𝐨𝐧 𝐄𝐜𝐨𝐬𝐲𝐬𝐭𝐞𝐦: 𝐒𝐤𝐢𝐥𝐥𝐬 𝐄𝐯𝐞𝐫𝐲 𝐃𝐞𝐯𝐞𝐥𝐨𝐩𝐞𝐫 𝐒𝐡𝐨𝐮𝐥𝐝 𝐌𝐚𝐬𝐭𝐞𝐫 Python is not just a programming language anymore — it’s an entire ecosystem powering data science, machine learning, web development, automation, and artificial intelligence. Python Certification Course :- https://lnkd.in/dBBCzZyh For developers who want to stay competitive in today’s tech landscape, understanding the Python ecosystem is a game changer. Here are some of the most important tools and libraries every developer should know: 🔹 Data Analysis — Pandas Helps clean, transform, and analyze structured data efficiently. 🔹 Machine Learning — Scikit-learn A powerful library for building predictive models and implementing ML algorithms. 🔹 Deep Learning — PyTorch & TensorFlow Widely used frameworks for building neural networks and advanced AI systems. 🔹 Natural Language Processing — NLTK Enables machines to understand and process human language. 🔹 Computer Vision — OpenCV Used for image processing, facial recognition, object detection, and more. 🔹 Web Scraping — BeautifulSoup Extracts useful data from websites for research, analytics, or automation. 🔹 APIs — FastAPI A modern framework for building high-performance APIs quickly. 🔹 Full-Stack Web Development — Django A robust framework for building scalable web applications. 🔹 Lightweight Web Development — Flask Perfect for small applications, prototypes, and microservices. 🔹 Big Data Processing — PySpark Allows Python developers to work with large-scale distributed data systems. 🔹 Workflow Automation — Apache Airflow Helps schedule and manage complex data pipelines. 🔹 Scientific Computing — NumPy The foundation for numerical computing and mathematical operations in Python. 🔹 Visualization — Matplotlib Transforms raw data into meaningful charts and visual insights. 🔹 ML App Deployment — Streamlit Quickly convert ML models into interactive web applications. 🔹 Desktop Applications — Kivy Build cross-platform desktop and mobile applications. 🔹 Cloud Automation — Boto3 Interact with AWS services directly from Python. 🔹 AI Agents — LangChain Build intelligent AI applications powered by large language models. 🔹 Web Automation — Selenium Automate browser tasks for testing, scraping, and repetitive workflows.
Like Comment
To view or add a comment, sign in

11,762 followers

168 Posts

View Profile Follow

Google Open-Sources LangExtract for Accurate Document Extraction

More Relevant Posts

Explore related topics

Explore content categories