Since my last post on 𝗟𝗟𝗠-𝗖𝗼𝘂𝗻𝗰𝗶𝗹 got some good responses, I thought I’d share an experience with another library I explored about a month ago: 𝗟𝗮𝗻𝗴𝗘𝘅𝘁𝗿𝗮𝗰𝘁 by 𝗚𝗼𝗼𝗴𝗹𝗲. At a high level, LangExtract gives you an interface to work directly with data extraction. You pass raw text as input, define your own custom schema, and the library extracts values based on that schema. Conceptually, it’s simple and quite powerful. One thing I liked is its flexibility. It can be used for multiple use cases, including PDF content extraction, which is a real problem space on its own. But there are a few limitations I ran into that are worth highlighting. First, LangExtract does 𝗻𝗼𝘁 𝗶𝗻𝗴𝗲𝘀𝘁 𝗣𝗗𝗙𝘀 𝗼𝗿 𝗼𝘁𝗵𝗲𝗿 𝗳𝗶𝗹𝗲𝘀 𝗱𝗶𝗿𝗲𝗰𝘁𝗹𝘆. It only accepts raw text as a string. So if you’re working with PDFs, PPTs, or similar formats, you need to build your own wrapper. That means extracting text using a PDF or PPT reader first, then passing that text into LangExtract. It works, but it adds extra engineering overhead. Second, while the library mentions a JSON-style structure for defining schemas, 𝗻𝗲𝘀𝘁𝗲𝗱 𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝘀 𝗮𝗿𝗲 𝘃𝗲𝗿𝘆 𝗹𝗶𝗺𝗶𝘁𝗲𝗱. You can define fields at one level, but going deeper becomes a problem. For example, if you model a patient → address → street hierarchy, you can’t represent this cleanly in a hierarchical way. Instead, you end up defining separate flat entities and extracting them independently, which feels restrictive for complex real-world data. That said, I still think LangExtract is important. Its real potential, in my opinion, will show up if it integrates 𝗩𝗶𝘀𝗶𝗼𝗻-𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 (𝗩𝗟𝗠𝘀). If OCR and visual understanding become native, users could directly ingest PDFs, scanned documents, or images without building custom wrappers. That would be a real game changer. Overall, LangExtract is a solid idea with clear strengths, but also some practical gaps today. I’m curious to see how it evolves, especially around multimodal ingestion. Would love to hear if others here have tried it or faced similar constraints. Github Repo: https://lnkd.in/dDb3_WPX #LangExtract #LangChain #LLMs #GenerativeAI #InformationExtraction #Google #LLMTools
LangExtract Review: Strengths and Limitations in Text Extraction
More Relevant Posts
-
We all know how challenging it can be to turn messy, unstructured documents into clean, usable data. It is often one of the most time-consuming parts of data engineering. I recently came across a new open-source library from Google called LangExtract, and I thought it was worth sharing with this network. It uses LLMs to help extract structured data, but what I really appreciate is its focus on transparency. Unlike many other tools, it provides "grounding" for every piece of information it extracts—meaning it points you exactly to where in the text the data came from. For anyone working in fields where accuracy and traceability are critical (like healthcare, legal, or finance), this feels like a very thoughtful solution. It also handles long documents gracefully, which is a nice bonus. If you have been looking for a more reliable way to handle unstructured text, I highly recommend checking out the repository. Link: https://lnkd.in/gb6VMJRn #DataEngineering #OpenSource #MachineLearning #Python #Google #AI
To view or add a comment, sign in
-
Building a robust AI system taught me one thing: the model is just a small part of the work. I recently built a scalable, event-driven AI architecture. While prompts are important, I realized that the real magic happens in the pipelines, the data structures, and the engineering decisions behind the scenes. Here’s a high-level look at how I approached building something meant for scale and reliability, not just a demo: 1️⃣ Foundation: Stability over novelty I chose Django + PostgreSQL because they are predictable, stable, and incredibly expressive for complex systems. 2️⃣ Understanding context with embeddings Instead of relying on keyword matching, I used semantic embeddings to represent text as vectors. This allows the system to understand intent and meaning. These vectors are stored directly in PostgreSQL using pgvector, keeping the architecture simple and operationally efficient. 3️⃣ Retrieval before generation (RAG) To keep responses grounded, the system first retrieves relevant information using vector search and only then asks the language model to reason over that context. This reduces hallucination and makes the output more reliable. 4️⃣ Asynchronous processing for real workloads Tasks like embedding generation or heavy analysis shouldn’t block user requests. I used Celery with Redis to move this work to background workers, keeping APIs fast and the user experience smooth. 5️⃣ Engineering for scale and cost awareness From indexing vectors, caching frequent queries, and using background jobs, the focus was on building something that can grow without unnecessary complexity or early over-spending. This project reinforced a simple idea for me: LLMs generate text, but engineering delivers value. Still learning, always open to constructive feedback. #AIEngineering #SystemDesign #RAG #VectorSearch #Embeddings #Python #Django #PostgreSQL #pgvector #Celery #Redis #BackendEngineering #ScalableSystems #SoftwareArchitecture
To view or add a comment, sign in
-
🚀 Breaking: Google just dropped LangExtract! Tired of extracting informatiin from messy, unstructured documents with high accuracy? Google just open-sourced LangExtract, a Python library designed to pull structured data with surgical precision. Whether it’s clinical notes, legal contracts, or complex reports/documents, you can now transform "wall of text" chaos into clean, usable data in just a few lines of code. Why this is a game-changer for devs: • 📍 Source Grounding: It doesn't just extract data; it maps every single entity back to its exact source location in the document. No more "black box" hallucinations—you can audit every result. • 📐 Schema Enforcement: Define your output once. LangExtract ensures consistent, structured JSON that actually fits your database. • ⚡ Built for Scale: Handles massive documents with ease using parallel processing and smart chunking. • 📊 Visual Validation: It automatically generates interactive HTML visualizations, letting you see the extractions highlighted directly on the original text. • 🤖 Model Agnostic: It’s not just for Google Gemini. It works with Ollama, local open-source models, and even OpenAI. • 🧠 Few-Shot Power: No fine-tuning required. It learns your specific domain (medical, finance, manufacturing etc.) with just a few examples. The best part? It’s completely open source. No hidden API fees, no usage limits, and full transparency. Ready to stop parsing and start extracting? 🔗 https://lnkd.in/g6gw6-M8 #AI #Python #OpenSource #DataScience #LLM #GoogleAI #MachineLearning #DocumentExtraction
To view or add a comment, sign in
-
𝗣𝗿𝗼𝗷𝗲𝗰𝘁: 𝗔𝗜 𝗥𝗲𝗰𝗲𝗶𝗽𝘁 𝗣𝗮𝗿𝘀𝗲𝗿 I built an AI application that converts unstructured photos of receipts into clean, structured JSON data. My goal was to replace manual data entry by using Multi-Modal LLMs to read images while ensuring the output is accurate and strictly validated. 𝗧𝗲𝗰𝗵 𝗦𝘁𝗮𝗰𝗸: Python (FastAPI), Groq SDK (Llama Vision), Docker, Pydantic, Streamlit. 𝗞𝗲𝘆 𝗙𝗲𝗮𝘁𝘂𝗿𝗲𝘀: 𝟏. 𝐕𝐢𝐬𝐮𝐚𝐥 𝐈𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞: used llama-4-scout (via Groq) to extract text. This allows the system to understand context, handling crumpled receipts or complex layouts. 𝟐. 𝐑𝐨𝐛𝐮𝐬𝐭 𝐕𝐚𝐥𝐢𝐝𝐚𝐭𝐢𝐨𝐧: I implemented strict Pydantic models to validate the AI's output. If the model hallucinates a date format or misses a required field, the backend catches and cleans it before the data reaches the user. 𝟑. 𝐂𝐮𝐬𝐭𝐨𝐦 𝐑𝐚𝐭𝐞 𝐋𝐢𝐦𝐢𝐭𝐢𝐧𝐠: I built a rate limiter to stop spam and prevent the API from getting overwhelmed. This keeps my AI usage within limits without needing heavy external tools like Redis. 𝟒. 𝐌𝐢𝐜𝐫𝐨𝐬𝐞𝐫𝐯𝐢𝐜𝐞𝐬 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞: I split the application into two distinct services (Frontend and Backend) and orchestrated them using Docker Compose, creating a clean, production-ready environment. 𝟓. 𝐃𝐲𝐧𝐚𝐦𝐢𝐜 𝐔𝐈: I built an interactive Streamlit dashboard that visualizes confidence scores and automatically detects currency symbols (e.g., switching between $ and ₹). 𝗟𝗶𝘃𝗲 𝗗𝗲𝗺𝗼: https://lnkd.in/gTwi7XNC 𝗔𝗣𝗜 𝗗𝗼𝗰𝘀 (𝗦𝘄𝗮𝗴𝗴𝗲𝗿 𝗨𝗜):https://lnkd.in/gJ3m8gsP 𝗦𝗼𝘂𝗿𝗰𝗲 𝗖𝗼𝗱𝗲: https://lnkd.in/gFQzsPsH #GenerativeAI #Python #FastAPI #BackendDeveloper #Groq #ComputerVision #Docker #SoftwareEngineering
To view or add a comment, sign in
-
Your laptop has 47 browser tabs open. Here's the AI workflow that closed them. The Problem: Research = chaos: - 12 tabs on competitor pricing - 8 tabs on industry trends - 15 tabs of docs I "might need" - 6 articles to "read later" Every session = 2 hours of context switching. What I Built: Content processing agent that reads, summarizes, organizes everything. Workflow: → Paste URL into Slack → Python fetches content → Claude extracts insights → Categorizes: Competitor/Docs/News → Saves to Notion → Daily digest The Code: from langchain.document_loaders import WebBaseLoader from langchain.chains.summarize import load_summarize_chain def process_url(url): loader = WebBaseLoader(url) docs = loader.load() chain = load_summarize_chain(llm=claude) summary = chain.run(docs) save_to_notion(url, summary) return summary Real Usage: Monday: 23 saved articles Output: "3 pricing changes, 2 trends, 1 case study" Reading: 8 minutes vs 3 hours Wednesday: Competitor launched feature Flagged 6 hours after launch Adjusted roadmap same day Results (1 Month): → 89 articles processed → 12 hours → 3 hours weekly → 4 opportunities found Stack: Slack | Python | Claude | Notion | n8n Cost: $6/month Key Features: ✅ Auto-tagging ✅ Duplicate detection ✅ Quote extraction ✅ Weekly trends Unexpected Win: Trend report spotted "AI code review" mentioned 4x Added to roadmap Would've missed with scattered tabs Mistakes Fixed: ❌ Too-short summaries ❌ Paywall issues ✅ Priority flags added ✅ Working search built The Prompt: "Summarize in 3 parts: 1. Main point (2 sentences) 2. Key data (bullets) 3. Actions for SaaS Focus: competitive intel, trends, technical details" Real Benefit: Not closing tabs. USING saved info vs hoarding links. Knowledge base now: - Searchable - Organized - Actually helpful Framework: 1. ID hoarding habit 2. Build intake 3. Automate: fetch → summarize → save 4. Create retrieval 5. Add digests Truth: You don't have a saving problem. You have a finding/using problem. AI solves that. How many tabs open right now? What are you hoarding that needs summarizing? #ProductivityHacks #AIAutomation #Python #LangChain #AITools #StartupLife #BuildInPublic #WorkflowOptimization #MachineLearning
To view or add a comment, sign in
-
🚀 𝗣𝘆𝘁𝗵𝗼𝗻 𝗖𝗵𝗲𝗮𝘁 𝗦𝗵𝗲𝗲𝘁 𝗘𝘃𝗲𝗿𝘆 𝗗𝗮𝘁𝗮 𝗣𝗿𝗼𝗳𝗲𝘀𝘀𝗶𝗼𝗻𝗮𝗹 𝗦𝗵𝗼𝘂𝗹𝗱 𝗠𝗮𝘀𝘁𝗲𝗿 (𝗣𝗮𝗻𝗱𝗮𝘀 & 𝗡𝘂𝗺𝗣𝘆) Python continues to be the backbone of 𝗱𝗮𝘁𝗮 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀, 𝗺𝗮𝗰𝗵𝗶𝗻𝗲 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴, 𝗮𝗻𝗱 𝗔𝗜 𝘄𝗼𝗿𝗸𝗳𝗹𝗼𝘄𝘀. This cheat sheet is a compact reminder of the 𝗺𝗼𝘀𝘁-𝘂𝘀𝗲𝗱 𝗣𝗮𝗻𝗱𝗮𝘀 𝗮𝗻𝗱 𝗡𝘂𝗺𝗣𝘆 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀 you’ll apply daily in real-world projects. 🔹 𝗣𝗮𝗻𝗱𝗮𝘀: 𝗧𝘂𝗿𝗻𝗶𝗻𝗴 𝗥𝗮𝘄 𝗗𝗮𝘁𝗮 𝗶𝗻𝘁𝗼 𝗜𝗻𝘀𝗶𝗴𝗵𝘁𝘀 Pandas is your go-to library for structured data handling. 𝗞𝗲𝘆 𝗮𝗿𝗲𝗮𝘀 𝘁𝗼 𝗳𝗼𝗰𝘂𝘀 𝗼𝗻: • 📅 𝐃𝐚𝐭𝐞𝐭𝐢𝐦𝐞 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐬 – Extract year, month, day for time-series analysi • 🔤 𝐒𝐭𝐫𝐢𝐧𝐠 𝐎𝐩𝐞𝐫𝐚𝐭𝐢𝐨𝐧𝐬 – Clean and preprocess messy text data • 🔀 𝐌𝐞𝐫𝐠𝐢𝐧𝐠 & 𝐉𝐨𝐢𝐧𝐢𝐧𝐠 – Combine datasets like a pro • 📊 𝐀𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐢𝐨𝐧 & 𝐆𝐫𝐨𝐮𝐩𝐁𝐲 – Summarize data efficiently • 🧹 𝐌𝐢𝐬𝐬𝐢𝐧𝐠 𝐃𝐚𝐭𝐚 𝐇𝐚𝐧𝐝𝐥𝐢𝐧𝐠 – Detect and treat null values • 🪟 𝐖𝐢𝐧𝐝𝐨𝐰 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐬 – Rolling, expanding, and exponential metrics for trend analysis 👉 These operations are essential for EDA, feature engineering, and reporting. 🔹 𝗡𝘂𝗺𝗣𝘆: 𝗛𝗶𝗴𝗵-𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗡𝘂𝗺𝗲𝗿𝗶𝗰𝗮𝗹 𝗖𝗼𝗺𝗽𝘂𝘁𝗶𝗻𝗴 NumPy powers fast mathematical operations under the hood of ML frameworks. 𝗖𝗼𝗿𝗲 𝗰𝗼𝗻𝗰𝗲𝗽𝘁𝘀 𝘁𝗼 𝗺𝗮𝘀𝘁𝗲𝗿: • 🔢 𝐀𝐫𝐫𝐚𝐲 𝐂𝐫𝐞𝐚𝐭𝐢𝐨𝐧 & 𝐒𝐡𝐚𝐩𝐞 𝐌𝐚𝐧𝐢𝐩𝐮𝐥𝐚𝐭𝐢𝐨𝐧 – Control dimensions with confidence • 📐 𝐋𝐢𝐧𝐞𝐚𝐫 𝐀𝐥𝐠𝐞𝐛𝐫𝐚 – Dot products, matrix operations, eigenvalues • 📈 𝐒𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐚𝐥 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐬 – Mean, variance, percentiles • 🎲 𝐑𝐚𝐧𝐝𝐨𝐦 𝐒𝐚𝐦𝐩𝐥𝐢𝐧𝐠 – Generate reproducible experimental data • ✂️ 𝐈𝐧𝐝𝐞𝐱𝐢𝐧𝐠 & 𝐒𝐥𝐢𝐜𝐢𝐧𝐠 – Efficient data access without loops 👉 Strong NumPy fundamentals = faster models and optimized pipelines. 🎯 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 If you can confidently use these functions, you can: ✅ Perform clean and logical EDA ✅ Build reliable features for ML models ✅ Debug data issues faster ✅ Write efficient, production-ready code 💡 𝗧𝗶𝗽: Don’t just memorize functions—apply them on real datasets. Muscle memory comes from practice. What part of Pandas or NumPy do you find most critical in your daily work?
To view or add a comment, sign in
-
-
The "Hello World" of GenAI is over. Glueing a Python script to an API and calling it an "app" isn't engineering. In 2026, the real challenge isn't the model, it’s the architecture. Code Vipassana Season 13 is back to solve the biggest bottleneck in AI: The Data Layer. We are building 100% Postgres Compatible AI apps with Google Cloud! Why does this matter? Most AI apps are slow and insecure because they move data out of the database to process it. We are teaching you how to move the intelligence into the database. By the end of these 4 sessions, you won't just have a "project", you’ll have a blueprint for enterprise-grade systems. What you will learn: In-Database Intelligence: Forget complex Python loops. Learn to perform multimodal analysis (seeing/thinking) directly within SQL. Massive Scale: Master ScaNN indexing to search 1 Million+ vectors with sub-second latency. Zero Trust Security: Use Row-Level Security (RLS) to ensure your AI agents only see what they are authorized to see, preventing data leaks by design. The Outcomes for You: Master the "Thinking" Stack: Move beyond basic RAG to build Real-Time Reasoning systems with Gemini 3 Flash. Postgres-Native Expertise: Since AlloyDB is 100% Postgres compatible, you’re gaining skills that apply to the world's most popular database. Credibility: This is a hands-on, practitioner-led mission. No slide decks, just architecture and implementation. Stop building wrappers. Start building the future. 🗓️ Dates: Jan 21, 22, 23, & 26 ⏰ Time: 8:00 PM – 9:30 PM (IST) 📍 For any queries, feel free to reach out to Abirami. Register from the link in the comments!
To view or add a comment, sign in
-
-
Day 9 — Embeddings & Vector Databases (the foundation of Semantic Search + RAG) 🔎🧠 Traditional search matches keywords. But humans search by meaning. That’s where embeddings come in. 1) What are Embeddings? An embedding is a numeric representation of text (or images/audio) that captures its meaning. So instead of comparing words, we compare vectors: - “How to reset my password?” - “I forgot my login credentials” These look different as text, but embeddings place them close together because they mean the same thing. ✅ Embeddings enable semantic search: search by intent, not exact keywords. 2) What is a Vector Database? A vector database stores embeddings and lets you quickly find the “closest” matches using similarity search. Think: Query → convert to embedding → find nearest vectors → return best chunks Popular use cases: - Document Q&A (RAG) - Internal knowledge search (Confluence, PDFs, runbooks) - Recommendation systems (“similar items”) - Customer support ticket matching 3) How RAG uses Embeddings (simple flow) - Break documents into chunks - Create embeddings for each chunk - Store them in a vector DB - When user asks a question → embed the question - Retrieve the most relevant chunks - Send chunks + question to the LLM to generate an answer ✅ This makes answers more accurate, grounded, and up-to-date. 4) Quick example (business scenario) User: “What’s our on-call escalation process?” RAG retrieves the exact policy section from your runbook and the LLM answers with the right steps. Start learning today (hands-on): - OpenAI Cookbook (embeddings + RAG examples): -https://lnkd.in/gBfXKjW6 - Vector DB concepts (Pinecone Learn): https://lnkd.in/guQUVwhe - Sentence Transformers (popular embedding library): https://www.sbert.net/ 👉 Tomorrow (Day 10): RAG best practices — chunking, overlap, re-ranking, and how to improve answer quality. #ArtificialIntelligence #Embeddings #VectorDatabase #SemanticSearch #RAG #LLM #GenerativeAI #AIEngineering #MLOps #TechTrends #LearnAI #AICommunity #100DaysOfAI #Day9 #TechVentureLLC
To view or add a comment, sign in
-
⚡ 𝗔 𝗣𝘆𝘁𝗵𝗼𝗻 𝗳𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 𝗳𝗿𝗼𝗺 𝗠𝗶𝗰𝗿𝗼𝘀𝗼𝗳𝘁 𝘁𝗵𝗮𝘁 𝗹𝗶𝗴𝗵𝘁𝘀 𝘂𝗽 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀 🤖⚡ ⠀ Most agent frameworks can run agents. ⠀ They can’t help agents learn from experience. ⠀ Improvement usually means: ❌ manual prompt tweaking ❌ retraining from scratch ❌ breaking working logic ⠀ That’s exactly what Agent Lightning fixes. 💡 ⠀ Agent Lightning is an open-source Python framework from Microsoft that adds a training layer on top of your existing agents — without rewriting their core logic. ⠀ It works with setups you already use 👇 • LangChain • AutoGen • OpenAI Agents SDK ⠀ 🧠 What’s different? Agent frameworks used to execute. They didn’t improve. ⠀ Agent Lightning introduces a clean loop for learning over time 👇 • Capture agent traces (prompts, actions, outcomes) 📜 • Define reward functions aligned to your goals 🎯 • Apply reinforcement learning to improve behavior 🔁 ⠀ All without throwing away what already works. ⠀ 🚀 Key Features • Works with existing agent stacks (LangChain, AutoGen, etc.) • Adds a training loop with minimal code changes • Supports RL, prompt tuning, and supervised fine-tuning • Automatically logs prompts, actions, and rewards • Fully customizable reward functions per use case ⠀ 🔓 100% open source. ⠀ 💡 Why this matters We’re moving from: 👉 agents that execute instructions to: 👉 agents that adapt, learn, and improve in production ⠀ If you’re building long-lived agents, this is the missing piece between running and learning. ⠀ 👉 Github Repo: https://lnkd.in/gJrmfGGf #AI #AIAgents #AgenticAI #Python #LangChain #AutoGen #ReinforcementLearning #MohammadKShah ⠀ ♻️ 𝗥𝗲𝗽𝗼𝘀𝘁 to help other builders discover this ➕ 𝗙𝗼𝗹𝗹𝗼𝘄 Mohammad Karimulla, PMP® for more content that makes complex AI topics feel simple.
To view or add a comment, sign in
-
-
OpenAI's data agent a great example of how structured / SQL data done right: https://lnkd.in/gDrQE8Pm 🎥🔊🖼️ Multimodal data requires Python which raises the bar: 1. Schemas aren't explicit - must be inferred from code. 2. Lineage isn't explicit either. Extracting it from code is harder, but feasible. The upside: Python as a single language removes an entire layer of context and simplifies reasoning. ✨ True meaning of data lives in the code ✨
To view or add a comment, sign in
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development