Pathway Revolutionizes Real-Time Data Processing for AI Pipelines

4mo

⚡ Stop Building Batch Pipelines. Real-Time Is the New Default. One of the most underrated yet fast-trending GitHub projects right now is Pathway — and it’s a big deal for anyone building AI or data-driven products. 🔗 GitHub: https://lnkd.in/g94iR7-c Why Pathway is gaining traction 👇 🔁 Real-time data processing (not batch-only like traditional ETL) 🧠 Built for AI pipelines — streaming → embeddings → inference 🐍 Python-native (no JVM pain) ⚙️ Unified ETL + streaming + analytics 🚀 Perfect for LLM apps, RAG systems, fraud detection, live analytics If you’re building: • AI agents • LLM-powered products • Streaming analytics • Real-time dashboards …this is a framework you should absolutely explore. The future of data isn’t offline processing. It’s live, continuous, and AI-native. ⭐ Star it. Study it. Build with it. #GitHubTrending #DataEngineering #AI #LLM #RealTimeData #OpenSource #Startups

GitHub - pathwaycom/pathway: Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG. github.com

To view or add a comment, sign in

More Relevant Posts

Siddharth Yadav
3mo
Report this post
The "Inconvenient Truth" about GenAI Engineering 🚀 Everyone is chasing Python and GenAI prototypes. But as a Senior Engineer who has spent years building at scale, I see a massive gap in the conversation. Here is the debate: Is GenAI just a "toy" until it meets a Java Enterprise backend? I’ve spent thousands of hours at the intersection of Human Capital Data and Agentic Workflows, and here is my contrarian take: 1. Python builds the "Brain," but Java builds the "Infrastructure." You can create a brilliant multi-agent workflow with LangGraph in a Jupyter notebook. But if you can't wrap it in a high-concurrency Spring Boot microservice that handles 100k+ requests with 99.9% uptime, it will never leave the lab. 2. Agents are the new Microservices. We used to orchestrate static APIs. Now, we orchestrate LLMs. The transition from REST to Agentic workflows isn't a trend—it’s the new standard for Enterprise Architecture. If you don't know how to deploy LangChain inside a Kubernetes cluster, you aren't building a product; you're building a demo. 3. Data is the only real Moat. Whether you use Databricks or Azure, your AI is only as smart as your data pipeline. Large-scale data only becomes a strategic asset when your SQL and big data backend are robust enough to feed the LLM exactly what it needs, when it needs it. 4. The "Unicorn" of 2026: The market doesn't need more "Prompt Engineers." It needs Full-Stack AI Architects who can bridge the gap between a Python ML model and a production-grade Java ecosystem. The Debate: Is Java's legendary stability the only thing keeping the AI revolution from breaking? Or is the "Python-first" world finally ready to kill the enterprise monolith? I want to hear from the Architects. Agree or Disagree? 👇 #GenAI #Java #SpringBoot #SoftwareEngineering #MachineLearning #SystemDesign #BigData #Python #EnterpriseAI #CloudComputing #LLM
Like Comment
To view or add a comment, sign in
Haasha B.
3mo
Report this post
Since my last post on 𝗟𝗟𝗠-𝗖𝗼𝘂𝗻𝗰𝗶𝗹 got some good responses, I thought I’d share an experience with another library I explored about a month ago: 𝗟𝗮𝗻𝗴𝗘𝘅𝘁𝗿𝗮𝗰𝘁 by 𝗚𝗼𝗼𝗴𝗹𝗲. At a high level, LangExtract gives you an interface to work directly with data extraction. You pass raw text as input, define your own custom schema, and the library extracts values based on that schema. Conceptually, it’s simple and quite powerful. One thing I liked is its flexibility. It can be used for multiple use cases, including PDF content extraction, which is a real problem space on its own. But there are a few limitations I ran into that are worth highlighting. First, LangExtract does 𝗻𝗼𝘁 𝗶𝗻𝗴𝗲𝘀𝘁 𝗣𝗗𝗙𝘀 𝗼𝗿 𝗼𝘁𝗵𝗲𝗿 𝗳𝗶𝗹𝗲𝘀 𝗱𝗶𝗿𝗲𝗰𝘁𝗹𝘆. It only accepts raw text as a string. So if you’re working with PDFs, PPTs, or similar formats, you need to build your own wrapper. That means extracting text using a PDF or PPT reader first, then passing that text into LangExtract. It works, but it adds extra engineering overhead. Second, while the library mentions a JSON-style structure for defining schemas, 𝗻𝗲𝘀𝘁𝗲𝗱 𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝘀 𝗮𝗿𝗲 𝘃𝗲𝗿𝘆 𝗹𝗶𝗺𝗶𝘁𝗲𝗱. You can define fields at one level, but going deeper becomes a problem. For example, if you model a patient → address → street hierarchy, you can’t represent this cleanly in a hierarchical way. Instead, you end up defining separate flat entities and extracting them independently, which feels restrictive for complex real-world data. That said, I still think LangExtract is important. Its real potential, in my opinion, will show up if it integrates 𝗩𝗶𝘀𝗶𝗼𝗻-𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 (𝗩𝗟𝗠𝘀). If OCR and visual understanding become native, users could directly ingest PDFs, scanned documents, or images without building custom wrappers. That would be a real game changer. Overall, LangExtract is a solid idea with clear strengths, but also some practical gaps today. I’m curious to see how it evolves, especially around multimodal ingestion. Would love to hear if others here have tried it or faced similar constraints. Github Repo: https://lnkd.in/dDb3_WPX #LangExtract #LangChain #LLMs #GenerativeAI #InformationExtraction #Google #LLMTools

GitHub - google/langextract: A Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive visualization. github.com
Like Comment
To view or add a comment, sign in
Mohammad Karimulla
3mo
Report this post
⚡ 𝗔 𝗣𝘆𝘁𝗵𝗼𝗻 𝗳𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 𝗳𝗿𝗼𝗺 𝗠𝗶𝗰𝗿𝗼𝘀𝗼𝗳𝘁 𝘁𝗵𝗮𝘁 𝗹𝗶𝗴𝗵𝘁𝘀 𝘂𝗽 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀 🤖⚡ ⠀ Most agent frameworks can run agents. ⠀ They can’t help agents learn from experience. ⠀ Improvement usually means: ❌ manual prompt tweaking ❌ retraining from scratch ❌ breaking working logic ⠀ That’s exactly what Agent Lightning fixes. 💡 ⠀ Agent Lightning is an open-source Python framework from Microsoft that adds a training layer on top of your existing agents — without rewriting their core logic. ⠀ It works with setups you already use 👇 • LangChain • AutoGen • OpenAI Agents SDK ⠀ 🧠 What’s different? Agent frameworks used to execute. They didn’t improve. ⠀ Agent Lightning introduces a clean loop for learning over time 👇 • Capture agent traces (prompts, actions, outcomes) 📜 • Define reward functions aligned to your goals 🎯 • Apply reinforcement learning to improve behavior 🔁 ⠀ All without throwing away what already works. ⠀ 🚀 Key Features • Works with existing agent stacks (LangChain, AutoGen, etc.) • Adds a training loop with minimal code changes • Supports RL, prompt tuning, and supervised fine-tuning • Automatically logs prompts, actions, and rewards • Fully customizable reward functions per use case ⠀ 🔓 100% open source. ⠀ 💡 Why this matters We’re moving from: 👉 agents that execute instructions to: 👉 agents that adapt, learn, and improve in production ⠀ If you’re building long-lived agents, this is the missing piece between running and learning. ⠀ 👉 Github Repo: https://lnkd.in/gJrmfGGf #AI #AIAgents #AgenticAI #Python #LangChain #AutoGen #ReinforcementLearning #MohammadKShah ⠀ ♻️ 𝗥𝗲𝗽𝗼𝘀𝘁 to help other builders discover this ➕ 𝗙𝗼𝗹𝗹𝗼𝘄 Mohammad Karimulla, PMP® for more content that makes complex AI topics feel simple.
Like Comment
To view or add a comment, sign in
Dhanji Bhagat
3mo
Report this post
Most AI coding tools try to solve every bug from scratch. But why are we making them reinvent the wheel when humans have already shared the answers on GitHub? A new framework called MemGovern changes this by turning millions of messy GitHub discussions into structured "experience cards." It filters out social chatter and saves the actual technical logic—like root causes and fix strategies—so an AI agent can look them up when it gets stuck. This matters because it moves AI beyond simple guessing. By letting agents "search and browse" historical human fixes, they become much more accurate and efficient at resolving real-world software issues. One practical takeaway: Quality beats quantity. The researchers found that providing an AI with clean, structured memory is far more effective and stable than just dumping raw, noisy data into a prompt. How do you think we can best bridge the gap between historical human expertise and AI reasoning? https://lnkd.in/dWKHkk4Y

GitHub - QuantaAlpha/MemGovern github.com
Like Comment
To view or add a comment, sign in
Rohan Dass Gujrati
3mo
Report this post
Building a robust AI system taught me one thing: the model is just a small part of the work. I recently built a scalable, event-driven AI architecture. While prompts are important, I realized that the real magic happens in the pipelines, the data structures, and the engineering decisions behind the scenes. Here’s a high-level look at how I approached building something meant for scale and reliability, not just a demo: 1️⃣ Foundation: Stability over novelty I chose Django + PostgreSQL because they are predictable, stable, and incredibly expressive for complex systems. 2️⃣ Understanding context with embeddings Instead of relying on keyword matching, I used semantic embeddings to represent text as vectors. This allows the system to understand intent and meaning. These vectors are stored directly in PostgreSQL using pgvector, keeping the architecture simple and operationally efficient. 3️⃣ Retrieval before generation (RAG) To keep responses grounded, the system first retrieves relevant information using vector search and only then asks the language model to reason over that context. This reduces hallucination and makes the output more reliable. 4️⃣ Asynchronous processing for real workloads Tasks like embedding generation or heavy analysis shouldn’t block user requests. I used Celery with Redis to move this work to background workers, keeping APIs fast and the user experience smooth. 5️⃣ Engineering for scale and cost awareness From indexing vectors, caching frequent queries, and using background jobs, the focus was on building something that can grow without unnecessary complexity or early over-spending. This project reinforced a simple idea for me: LLMs generate text, but engineering delivers value. Still learning, always open to constructive feedback. #AIEngineering #SystemDesign #RAG #VectorSearch #Embeddings #Python #Django #PostgreSQL #pgvector #Celery #Redis #BackendEngineering #ScalableSystems #SoftwareArchitecture
Like Comment
To view or add a comment, sign in
Aman Saini
3mo
Report this post
𝗣𝗿𝗼𝗷𝗲𝗰𝘁: 𝗔𝗜 𝗥𝗲𝗰𝗲𝗶𝗽𝘁 𝗣𝗮𝗿𝘀𝗲𝗿 I built an AI application that converts unstructured photos of receipts into clean, structured JSON data. My goal was to replace manual data entry by using Multi-Modal LLMs to read images while ensuring the output is accurate and strictly validated. 𝗧𝗲𝗰𝗵 𝗦𝘁𝗮𝗰𝗸: Python (FastAPI), Groq SDK (Llama Vision), Docker, Pydantic, Streamlit. 𝗞𝗲𝘆 𝗙𝗲𝗮𝘁𝘂𝗿𝗲𝘀: 𝟏. 𝐕𝐢𝐬𝐮𝐚𝐥 𝐈𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞: used llama-4-scout (via Groq) to extract text. This allows the system to understand context, handling crumpled receipts or complex layouts. 𝟐. 𝐑𝐨𝐛𝐮𝐬𝐭 𝐕𝐚𝐥𝐢𝐝𝐚𝐭𝐢𝐨𝐧: I implemented strict Pydantic models to validate the AI's output. If the model hallucinates a date format or misses a required field, the backend catches and cleans it before the data reaches the user. 𝟑. 𝐂𝐮𝐬𝐭𝐨𝐦 𝐑𝐚𝐭𝐞 𝐋𝐢𝐦𝐢𝐭𝐢𝐧𝐠: I built a rate limiter to stop spam and prevent the API from getting overwhelmed. This keeps my AI usage within limits without needing heavy external tools like Redis. 𝟒. 𝐌𝐢𝐜𝐫𝐨𝐬𝐞𝐫𝐯𝐢𝐜𝐞𝐬 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞: I split the application into two distinct services (Frontend and Backend) and orchestrated them using Docker Compose, creating a clean, production-ready environment. 𝟓. 𝐃𝐲𝐧𝐚𝐦𝐢𝐜 𝐔𝐈: I built an interactive Streamlit dashboard that visualizes confidence scores and automatically detects currency symbols (e.g., switching between $ and ₹). 𝗟𝗶𝘃𝗲 𝗗𝗲𝗺𝗼: https://lnkd.in/gTwi7XNC 𝗔𝗣𝗜 𝗗𝗼𝗰𝘀 (𝗦𝘄𝗮𝗴𝗴𝗲𝗿 𝗨𝗜):https://lnkd.in/gJ3m8gsP 𝗦𝗼𝘂𝗿𝗰𝗲 𝗖𝗼𝗱𝗲: https://lnkd.in/gFQzsPsH #GenerativeAI #Python #FastAPI #BackendDeveloper #Groq #ComputerVision #Docker #SoftwareEngineering

1 Comment
Like Comment
To view or add a comment, sign in
Yaswanth R
3mo
Report this post
Stop Googling shell commands. Start generating them. 🚀 I’ve recently been using llmshell-cli, and it has completely changed the way I interact with my terminal during local development. If you’re like me, you probably spend a decent chunk of your day switching between the terminal and a browser to look up specific Docker, find, or git commands. This tool bridges that gap by bringing the power of LLMs directly into the CLI. Why I’m finding it useful for my projects: ✅ Zero Context Switching: I can just type my intent in plain English, and it gives me the exact shell command I need instantly. ✅ Local + Remote Flexibility: I can toggle between using remote APIs like OpenAI/Anthropic for complex tasks, or keep everything private by running local models via Ollama. ✅ Safe Execution: It doesn’t just run commands blindly; it generates them for review, allowing me to learn the syntax as I go. ✅ Pure Productivity: It has significantly reduced the friction in my workflow, especially when dealing with complex infrastructure tasks. If you’re a developer, DevOps engineer, or just someone who spends a lot of time in the terminal, you should definitely check this out. Huge shoutout to the project! Check it out on PyPI here: 🔗 https://lnkd.in/g6-npH2J 🔗 🔗 https://lnkd.in/guCs_pvb 🔗 #DeveloperTools #Productivity #LLM #Terminal #CLI #Python #OpenSource #AI #CodingEfficiency

GitHub - imgnr/llmshell-cli: Convert natural language to shell commands using LLMs (GPT4All, OpenAI, Ollama). Local-first, safe, and easy to use. github.com

1 Comment
Like Comment
To view or add a comment, sign in
Tohid Bagani
3mo
Report this post
Backend design choices quietly shape data quality Data problems often start long before data science. Most issues we later label as “dirty data” are actually design decisions made upstream. API contracts that were never clearly defined. Inputs that were never validated. Logs that were added as an afterthought instead of as a first-class signal. By the time the data reaches analysis or modeling, the damage is already done. I’ve learned that backend systems don’t just move data. They define its meaning. Every loose schema, optional field, or silent failure creates ambiguity that compounds downstream. Analysts compensate with assumptions. Models learn patterns that don’t really exist. What looks like a modeling issue is often a systems issue. Once you start treating data as a product of backend design, priorities shift. Validation becomes essential. Logging becomes intentional. Contracts become explicit. Good data is rarely “cleaned” into existence. It’s designed that way from the start. #datascience #AI #backend #python #machinelearning #dataanalysis
Like Comment
To view or add a comment, sign in
Srinivas Kalyan Nandivada
4mo
Report this post
Step by step road map in AI Solutioning. Basics: - Python Basics - OOPs/Modular Programming - Jupiter Notebooks & interactive development - Working with REST APIs - Project structure in Python Modules - Dependency Management - git, github and github actions LLM Work Flow: 1. How LLMs work on a high-level 2. Token limits and context window 3. What is the LLM temperature value (0 - no creativity stick to be deterministic) 4. Few-Shot Prompting, chain-of-thought prompting, system-prompt 5. Structured Outputs 6. Function calling or tool-calling 7. Different modal capabilities 8. Work with native APIs - OpenAI SDK, Anthropic SDK Integration: 1. LangChain - Chat Models ChatOpenAI, ChatAnthropic and other LLM wrappers 2. Prompt Templates - Creating reusable, prompt templates 3. Output Parsers - Structured outputs 4. Chains - Sequential task execution (LCEL) 5. Tool & Function Calling - Creating and Using Tools, @tool decarator 6. Agents - React agents 7. Memory - ConversationBufferMemory, ConversationSummaryMemory 8. LangGraph - Building workflow graphs with nodes and edges 9. State Management - Defining and managing state across workflow steps 10. Conditional Edges - Dynamic routing based on state or output 11. Human-In-Loop - Adding breakpoints and human approvals steps 12. Checkpointing - Saving and resuming workflow state 13. Cycles & Loops - Building iterative workflows that loop until conditions are met 14. Multi-Agent Systems - Coordinative multiple agents in a workflow RAG: 1. Understand what RAG is and what vector embedding’s are 2. chunk strategy using OCR, Layout Detection Models using libraries 3. Use vector databases - pgvector, chroma, weaviate 4. Build ingestion/retrieval pipelines to embed and store documents 5. vector search, hybrid search, multi-query retrieval, RRF, Re-ranking Step 5 - Production Systems: 1. Turn your prototypes into scalable, production-ready AI applications 2. Build APIs with Fast API & Pydantic for validation 3. Docker & Docker Compose - containerize and deploy your AI apps 4. PostgreSQL + PGvector + Database Design 5. Celery + Redis - background workers for long-running AI tasks LLM Monitoring, Evaluation & Safety: 1. LangSmith/LangFuse for tracing 2. Capture traces - Log all LLM calls including inputs, outputs, latencies and token costs 3. Centralized logging ELK stack for log aggregation and analysis 4. Create Evaluations 5. Guard rails for safety and security such as Prompt injection protection, PII Guardrails..etc 8. Create evaluations to test system regularly 9. Set up health checks and alerts - Monitor uptime and system health with tools like Sentry Deployment: 1. CSP - AWS, Azure, GCP and learn the basics 2. Set up vm or container services - EC2/Cloud Run/App Engine 3. Deploy via Docker - Containerize and ship your AI Applications 4. Manage env variables and secrets - .env files, secret managers 5. CI/CD basics and GitHub Actions/Jenkins - Automate testing and deployment 6. Role Kubernetes would play
Like Comment
To view or add a comment, sign in
Srijit Seal
3mo Edited
Report this post
I have been learning how to work closely with Claude Code. When the repository is designed well, the AI behaves like a competent collaborator. When it is not, the AI is little more than an autocomplete engine. Below are the design choices that made the difference for us! 1. CLAUDE md as a Project Operating Manual The most important file in the repository is a markdown file at the root named CLAUDE md. This file contains the information a new contributor would need on day one: How code is executed (for example, pixi run python rather than Python) Where data and databases live Non-negotiable conventions, such as which statistical values to use Claude reads this file automatically and uses it as a persistent project context. Anything that must be done correctly every time belongs here. 2. Explicit Skills for Domain Knowledge We have a .claude/skills/ directory to encode domain expertise. Each skill is a short markdown document that describes concepts, metrics, configuration locations, and expected outputs. Examples include skills for reproducibility snapshots, context loading, and structured writeups. Invoking a skill loads the domain context before any code is written, which dramatically reduces errors and misinterpretations. This is the point where the AI begins to understand the science... 3. Consistent Analysis Script Structure Every analysis script follows the same pattern: a clear research question in the docstring, a typer CLI interface, structured logging, and predictable output paths. Because the structure is consistent, the AI learns it. When asked to create a new analysis, it already knows how arguments, logging, and outputs should be handled without further instruction. 4. Tooling That Favors MY use case pixi for reproducible environments just for task orchestration typer for CLI interfaces loguru for logging DuckDB for analysis and storage Snakemake for pipeline structure Tools that rely on implicit configuration or complex shell logic consistently caused problems. Declarative systems are easier for both humans and AI to reason about. 5. Self-Documenting Data Schemas DuckDB tables include column-level comments describing semantics. This allows the AI to inspect schemas and understand what values represent without additional explanation. This small investment significantly reduced clarification cycles. 6. Planning before execution! For multi-step or high-impact changes, I use a planning step where the AI explores the codebase and writes an implementation plan without making edits. Only after review does execution begin. 7. A Repeatable Workflow A typical workflow now looks like: context loading analysis structured writeup reproducibility snapshot commit and push In recent sessions, this setup allowed the AI to generate full analysis scripts, diagnose visualization bugs by understanding the data, produce correct plots, and create clean, domain-aware commits. The improvement came from structure!
1 Comment
Like Comment
To view or add a comment, sign in