Scaling Recommendation Engine to 200ms Response Time

I built a recommendation engine that had to respond in under 200ms. Here's what I learned about the gap between "it works" and "it works at scale." The first version was straightforward. Python service, takes user behavioral data, scores items, returns a ranked list. In development it worked great. In production with real traffic, it was way too slow. The problem wasn't the algorithm. It was when we were doing the work. We were computing recommendations at request time. Every API call triggered a fresh scoring pass over the dataset. At low traffic, fine. At real traffic, timeouts. The fix was separating the work into two parts: → Precompute: a background pipeline that scored and ranked recommendations ahead of time based on behavioral signals, then wrote the results to Redis → Serve: the API just read from Redis. No computation at request time. Sub-200ms, consistently. But the harder part wasn't the caching. It was knowing which strategy to trust. We had multiple ranking approaches. Instead of picking one based on gut feeling, we ran them side by side and compared on three signals: 1. Engagement: did users actually click/act on what we recommended? 2. Latency: did the serving path stay fast? 3. Coverage: were we recommending the same 20 items to everyone, or actually personalizing? That comparison was more valuable than any single optimization. It turned "we think this ranking is better" into "here's the data, pick the tradeoff you want." The takeaway: personalization is easy to demo and hard to ship. The difference is knowing what to precompute, what to serve live, and having the discipline to measure which approach actually works instead of guessing. #softwareengineering #python #recommendationsystems

To view or add a comment, sign in

More Relevant Posts

Srimon Danguria
2w Edited
Report this post
One of the biggest challenges in vector search is not retrieval itself. It is the query interface. qql-go was built to solve this particular problem in mind: agents first, humans too. The starting point was QQL (qdrant query language), originally shared by Kameshwara Pavan Kumar Mantha. The original idea, repo, and write-up came from that work. The idea brings the possibility of giving vector retrieval a cleaner interface for repeated use inside agent workflows. That is what led to qql-go: an independent Go port and extension of the idea. Repo: https://lnkd.in/gXjQdjaw The focus was simple: clean CLI, structured output, and a path that works well inside Skills. 👉 Install the Skill, and the agent can do the rest. That makes the whole thing much easier to start with, especially for Qdrant Cloud. Qdrant gives a very good entry point here: 1. free dense vectors (sentence-transformers/all-minilm-l6-v2) inference. 2. free BM25 (qdrant/bm25) inference. 3. free ColBERT multivector model. (answerdotai/answerai-colbert-small-v1). 4. 4 GB always-free cloud tier. So you can start with a real hybrid+reranking retrieval setup without spending money upfront. That is the part that matters. A retrieval interface becomes much more useful when it is: easy for agents to call, easy for humans to inspect, and cheap enough for people to actually adopt. Credit to Kameshwara Pavan Kumar Mantha for putting the original QQL idea out there and giving others something worth building on. 📖 Read the full article from the qql creator : https://lnkd.in/g_nh9T7s Original qql repo:- https://lnkd.in/gwppzjgw #Qdrant #Retrieval #AIEngineering #OpenSource #GoLang #DeveloperTools #Agents #VectorSearch #Skills

GitHub - pavanjava/qql github.com
Like Comment
To view or add a comment, sign in
Pagalavan P
6d Edited
Report this post
Why does Google never rank spam pages that just repeat the same keyword 100 times? The answer is an algorithm called BM25 — and it's been quietly powering Elasticsearch, Lucene, and most production search stacks for decades. Most devs learn about embeddings and vector search first. But skip BM25? You're missing the foundation that makes RAG pipelines actually work. Here's what BM25 solves that basic keyword search can't: → Keyword stuffing? Penalised. → Long documents dominating short ones? Normalised. → Common words like "the" outweighing rare ones? Weighted down. I wrote a beginner-friendly breakdown with: ✅ The formula explained in plain English ✅ Python code from scratch + rank-bm25 library ✅ An interactive demo you can run in your browser ✅ How to plug BM25 into a RAG pipeline If you're building search or RAG systems and haven't thought about BM25 yet — this one's for you. 👇 Full article (with live demo): https://lnkd.in/gbmFZaUD
Like Comment
To view or add a comment, sign in
Heritage Adeleke
1w
Report this post
#Day_23/100: Before I finalise HERVEX — I want to get this right. For the past 13 project days, I've been building HERVEX — an autonomous AI Agent API from scratch. The full pipeline is now connected: Goal Intake → Planner → Task Queue → Executor → Tools → Memory → Aggregator → Final Result Here's what's under the hood: → FastAPI receives a goal in plain English and returns a session ID instantly → Groq (llama-3.3-70b) breaks the goal into an ordered task list → Celery + Redis queues and executes tasks in the background → Tavily web search gives the agent real internet access → Redis memory keeps context alive across every task in the session → The aggregator sends all results back to the LLM for one final coherent response → MongoDB persists everything — goals, tasks, runs, and final results Phase 8 is next — refinements, additional tools, testing, and documentation. But before I close this out, I want to ask the people who've built things like this: What should I double-check? What edge cases am I likely missing? What would you add or remove before calling it production-ready? Specifically, I'm thinking about: → Error recovery — what happens if a task fails mid-run? → Rate limiting — protecting the API from abuse → Tool reliability — what if Tavily returns empty results? → LLM hallucination — how do I validate agent outputs? → Observability — logging, tracing, monitoring If you've built agentic systems, autonomous pipelines, or production backends — I'd genuinely value your input. Drop your thoughts in the comments or DM me. Stack: Python · FastAPI · Groq · Celery · Redis · MongoDB · Tavily #BuildingInPublic #AgenticAI #BackendEngineering #Python #FastAPI #HERVEX #AIAgents #100DaysOfCode #ProjectDay13
Like Comment
To view or add a comment, sign in
Gustavo Araujo Dunhão
1w
Report this post
Most RAG tutorials stop at "just use the advisor and it works." But when your RAG system gives a weird answer at 2am, you need to know what happened underneath. What chunks did it actually retrieve? Were they even relevant? What score did they get? I just published the third post in my Spring AI RAG series — this time we skip the QuestionAnswerAdvisor entirely and go straight to the vector store. What you'll learn: → How to run similarity searches directly against PgVectorStore → Why similarity thresholds matter (and why the default "return everything" is dangerous) → How to inspect raw embeddings to see what the model actually "sees" → Practical tips for tuning topK and threshold values → How to peek into the PostgreSQL vector_store table with raw SQL The key insight that changed how I think about RAG: topK always returns K results, no matter how bad they are. Ask about baking a cake when your store only has Spring AI docs? You'll still get 5 results. Setting a similarity threshold fixes this — and it's one line of code. Read the full post: https://lnkd.in/dJpR4bKT Source code (clone and run): https://lnkd.in/dGSs_Hsg If you found this useful, I'd really appreciate a share — it helps more people discover the series. 🙏 #SpringAI #RAG #Java #VectorStore #pgvector #Ollama #AIEngineering #LLM #SpringBoot
Like Comment
To view or add a comment, sign in
SynapseKit AI

13 followers
3w
Report this post
🚀 SynapseKit v1.5.0 is out and this one is special. It's our biggest community-driven release yet. 12+ new features, all shipped by contributors from around the world. New loaders (4): 📂 GCSLoader — Google Cloud Storage 🗄️ SQLLoader — any SQLAlchemy database (Postgres, MySQL, SQLite...) 🐙 GitHubLoader — READMEs, issues, PRs, repo files 📰 RSSLoader New tools (4) : 📋 LinearTool — manage Linear issues from your agents 📰 NewsTool — NewsAPI headlines + search 🌦️ WeatherTool — OpenWeatherMap forecasts 💳 StripeTool[Stripe]— read-only Stripe lookups New LLM providers (3): 🤖 xAI (Grok)[xAI] ⚡ NovitaAI[Novita AI] ✍️ Writer (Palmyra) Plus HTMLTextSplitter for clean HTML chunking. Where we are now: 30 LLM providers · 46 tools · 29 loaders · 9 vector stores · 9 text splitters · 1,715 tests · still just 2 hard runtime deps. Huge thanks to @qorexdev, @DhruvGarg111, and @Abhay Krishna for the loaders and tools that made this release. This is what open source is supposed to feel like. 💚 Async-native. Streaming-first. Apache 2.0. 📦 pip install -U synapsekit 📖 Docs + GitHub link in the first comment. #Python #LLM #OpenSource #RAG #AI #SynapseKit

1 Comment
Like Comment
To view or add a comment, sign in
Stavan Ravisaheb
1mo
Report this post
A few weeks ago I shared how I built a system that trades defense stocks by tracking government contracts and running FinBERT sentiment in under 250ms. Since then I kept building. Here's what the pipeline looks like now. What changed: The original version had one news source (Finnhub), 19 tickers, and a WebSocket server that would silently drop connections under load. It worked, but it was fragile in ways I didn't fully appreciate until I started stress-testing it. So I went back through every layer. (1) News coverage: Added Google News RSS and Yahoo Finance RSS as two additional parallel feed sources — no API keys, no rate limits, free. The pipeline now ingests from three sources simultaneously across 28 tickers (added AXON, CRWD, PANW, NVDA, INTC, AMD, HWM, BWXT, TXT among others). Finnhub runs on a 90s cycle, RSS runs on 60s. All three feeds write to the same Redis Stream so the engine processes them identically. Signal quality: The same article published by Finnhub and RSS 30 seconds apart used to fire two separate NLP runs and two orders. Added headline dedup via content hash with a 5-minute TTL. Also added a circuit breaker — if a malformed feed ever delivers a burst beyond 20 articles/minute, the engine pauses instead of flooding the FinBERT thread pool. (2) Connection stability: There was a subtle bug in the Rust WebSocket server. A tokio else => break clause in the select! loop was catching browser Pong frames and silently killing connections. Dashboards left open for more than a few minutes would go dark without any error. Fixed by explicitly matching on Pong/Close/Error and adding a 30s ping interval to keep connections alive through NAT timeouts. (3) Desktop app: The Tauri installer now bundles all 5 Python services and launches them silently on startup — no terminal windows, no manual setup after the first run. Added an update check that polls the GitHub releases API on launch and shows a banner if a newer version exists. Added live service health indicators that TCP-probe Redis and the Rust core every 10s. Infrastructure: Fixed two CI/CD bugs that had been silently preventing releases from publishing since v1.2.0. The release builder was missing permissions: contents: write on the GitHub token, and the Python syntax check was running from the wrong directory. Neither caused visible build failures — they just quietly skipped the release creation step. The full pipeline: Rust (tokio/axum) → Redis Streams → Python (FinBERT + spaCy) → React dashboard → Alpaca paper trades. Still 100% free to run. Repo is open source. If you're building anything in the quant/NLP/Rust space, curious what signals you think are underutilized at the retail level. 👉 https://lnkd.in/gwr7ZvQZ #AlgoTrading #Rust #MachineLearning #NLP #OpenSource #FinTech #DefenseTech #QuantFinance
3 Comments
Like Comment
To view or add a comment, sign in
Chandrashekar Reddy Dadapuram
1w Edited
Report this post
🚀 Built a scalable RAG (Retrieval-Augmented Generation) system using FastAPI + LLMs Most people focus on the LLM. I learned the hard way: retrieval quality matters more. 🔧 Tech Stack: FastAPI | Qdrant | MongoDB | Redis | Gemini/Ollama ⚙️ What I built: • End-to-end pipeline: ingestion → chunking → embeddings → retrieval → response • Hybrid search (semantic + keyword) for better recall • Vector retrieval using Qdrant • Structured LLM outputs with schema validation 📈 Results: • 🎯 Improved answer accuracy with hybrid search • ⚡ ~40% lower API latency (caching + optimized queries) • 🧩 ~30% faster development (modular pipeline) 💡 Key takeaway: A strong RAG system isn’t about plugging in an LLM — it’s about retrieving the right context efficiently. what’s been your biggest challenge when building with LLMs? #AI #LLM #RAG #FastAPI #Python #GenerativeAI #Backend
2 Comments
Like Comment
To view or add a comment, sign in
Zach Welshman, PhD
3w Edited
Report this post
I built a Model Context Protocol-powered doc assistant in Streamlit (with the help of Claude) and it taught me more than I expected about general application of Agents, LLMs and MCPs. 🧠 The idea is simple: query official library documentation using natural language, with Claude as the conductor. Select from a catalogue of Python and R libraries (pandas, PySpark, dbplyr, scikit-learn, and more), point it at GitHub-hosted docs via gitmcp.io, and ask anything. But the real insight came from connecting it to custom MCP servers. Here's what I learned: 🔗 You can mix official docs with any custom MCP server. Open-source tooling like a database (Supabase)? Hook it in. The architecture doesn't care where the knowledge lives, although system prompts can really help point the agent in the right direction, what is important is that there's an MCP endpoint to call. 🤖 The LLM is the conductor, not the worker. Claude doesn't know your codebase. But give it a set of MCP tools, and it figures out what to call, in what order with the help of an llms.txt file. Building this really help me turn the the concept of an "agent loop" to a real life use case. 🔑 Making AI tools accessible matters. The app accepts your own Anthropic API key directly in the browser, no server-side secrets needed for personal use. Lowering that barrier changes who can actually use the thing. 📚 Docs are just another data source. Once you think of documentation as something a model can query — not just read — the design space opens up. Structured retrieval, versioned docs, multi-repo search: it's all the same pattern. Other things I picked up along the way: → Token cost is real and visible. Tracking per-message cost ($1/$5 per 1M input/output tokens) immediately changed how I thought about Agent architecture. → Rate limits force you to think about server selection. Capping active MCP servers to 2 taught me to be intentional. The stack: Streamlit · Anthropic SDK · MCP Python client · gitmcp.io · claude-haiku-4-5 If you're exploring agentic patterns, happy to share and learn more about your use cases. #LLMs #DataScience #AgenticAI #DataEngineering

2 Comments
Like Comment
To view or add a comment, sign in
SynapseKit AI

13 followers
6d
Report this post
🚀 SynapseKit v1.6.0 is live on PyPI Our biggest release yet and it's packed. When we started SynapseKit, the goal was simple: build the LLM framework we wished existed. Async-native, streaming-first, no bloat. Two hard dependencies. Today, v1.6.0 takes that foundation and adds serious production breadth. What's new: 🗄️22 vector store backends Vespa[Vespa.ai] · Redis[Redis] · Elasticsearch[Elastic] · OpenSearch[OpenSearch Project] · Supabase[Supabase] · Typesense · Marqo[Marqo] · Zilliz[Zilliz] · DuckDB[DuckDB] · ClickHouse[ClickHouse] · Cassandra: all with the same 3-line interface. Drop in whichever your infra already runs. 📄 64 document loaders Firestore · Zendesk · Intercom · Freshdesk · Hacker News · Reddit · Twitter · Google Calendar · Trello : plus the full suite from prior releases. If your data lives somewhere, there's now a loader for it. 🔍 4 new retrieval strategies RAPTOR (recursive abstractive tree) · Agentic RAG (tool-using retriever) · Document Augmentation (HyDE-style query + doc expansion) · Late Chunking(full-doc embeddings before splitting) 🤖 SwarmAgent —> spawn specialist sub-agents dynamically based on task complexity. Real multi-agent coordination, not just chaining. 🎤 VoiceAgent —> full STT → agent → TTS pipeline. OpenAI Whisper or local faster-whisper. pyttsx3 or OpenAI TTS. Mic/speaker streaming built in. 🧩 Plugin system —> PluginRegistry + BasePlugin. Package your integrations, publish them, load them with one line. ⏱️ 12 performance fixes —> semantic cache BLAS lookups, O(1) vector inserts, parallel ensemble retrieval, persistent HTTP sessions, rate limiter deadlock fix, and more. And: CronTrigger · EventTrigger · StreamTrigger · AgentMemory · BrowserTool · TimedResumeGraph · ReplicateLLM · Agent Benchmarking Suite · Visual Graph Builder 34 LLM providers. 64 loaders. 22 vector stores. 2 hard dependencies. The design constraint stays the same. The scope keeps growing. pip install synapsekit==1.6.0 Docs → https://lnkd.in/dcptxYin GitHub → https://lnkd.in/d2fGSPkX Huge thanks to every contributor who shipped PRs for this release, this wouldn't exist without you. 🙏 #OpenSource #Python #LLM #RAG #AI #MachineLearning #SynapseKit

GitHub - SynapseKit/SynapseKit: Minimal, async-first Python framework for production LLM apps- 2 hard deps, no magic, no SaaS. github.com

4 Comments
Like Comment
To view or add a comment, sign in

4,693 followers

13 Posts

View Profile Follow

Scaling Recommendation Engine to 200ms Response Time

More Relevant Posts

Explore related topics

Explore content categories