Building a resilient data pipeline is about more than just writing a scraper. 🏗️ I’m excited to share my latest project, CityPulse AI. The challenge was building an agent that could handle dynamic web content while maintaining a structured, cloud-based data store. Key Engineering Highlights: 🔹 Resilient Scraping: Implemented a hybrid engine using SerpApi for speed, with a Selenium fallback to handle dynamic roadblocks. 🔹 Cloud Persistence: Integrated Supabase to move beyond static local files, allowing for scalable data storage and future trend analysis. 🔹 Geospatial Analysis: Used Mapbox and Plotly to transform raw coordinates into actionable heatmaps. This project was a great exercise in full-stack data engineering—from raw ingestion to interactive visualization. Source Code: [https://lnkd.in/grsAKpTC] Live App: [https://lnkd.in/gV5WgF_4] #DataEngineering #PostgreSQL #Python #ETL #SoftwareDevelopment #Streamlit #Supabase #Selenium #GoogleMapsAPI #CloudComputing #FullStackDeveloper #MarketIntelligence #BusinessIntelligence #LeadGeneration #DataDriven #Innovation
More Relevant Posts
-
🚀 Just published a new deep‑dive for backend engineers and AI builders! I walked through how to build a DuckDuckGo Search Storage API using FastAPI + SQLModel — a clean, modern stack that’s perfect for lightweight search pipelines, data ingestion, and AI‑powered applications. This piece breaks down: 🔹 Designing a modular API architecture 🔹 Integrating DuckDuckGo search programmatically 🔹 Persisting structured results with SQLModel 🔹 Clean async patterns for high‑performance workloads 🔹 Why this pattern is ideal for LLM agents, retrieval layers, and microservices If you're exploring search augmentation, RAG pipelines, or lightweight data services, this is a practical blueprint you can adapt instantly. 📘 Read the full blog: Building a DuckDuckGo Search Storage API with FastAPI and SQLModel https://lnkd.in/gigV2dAq
To view or add a comment, sign in
-
Scrolling through negative headlines every day? I decided to change that. 𝐝𝐚𝐢𝐥𝐲𝐩𝐨𝐬𝐢𝐭𝐢𝐯𝐞.𝐧𝐞𝐰𝐬 automatically highlights positive news that actually matters. It’s an automated data pipeline + backend system that aggregates news from 16+ trusted sources and uses AI to filter noise and highlight articles with genuine positive human impact. The frontend is intentionally simple. It focuses on highlighting the day's most relevant positive news. The real work is in the pipeline, data flow, and backend architecture. 🔧 What I built: 🧠 𝐃𝐚𝐭𝐚 𝐏𝐢𝐩𝐞𝐥𝐢𝐧𝐞 (Python 3.12) A fully automated pipeline running 24/7: • Hourly and daily jobs fetching RSS feeds from BBC, Nature, MIT, Forbes, The Guardian, HBR and 10+ other sources • GPT-4o-mini scoring each article on positivity and human impact (0-1 scale) • Batch AI processing (20 articles per call, structured JSON output), bringing AI cost down to cents per day • PostgreSQL deduplication by URL (ON CONFLICT DO NOTHING) • Graceful degradation. One source failing never stops the pipeline 🧩 𝐑𝐄𝐒𝐓 𝐀𝐏𝐈 (Node.js + TypeScript) • Express.js with strict TypeScript • Helmet security headers and rate limiting (100 req / 10 min) • PostgreSQL connection pooling (pg) • Flexible filters by score, category, source, date range, country and language ☁️ 𝐈𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 (AWS) • Single EC2 instance running Docker Compose • 4 containers: PostgreSQL, pipeline, API and Nginx • Nginx as reverse proxy and static file server • HTTPS with Let’s Encrypt (Certbot auto-renewal) • Custom domain 🎯 𝐊𝐞𝐲 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐝𝐞𝐜𝐢𝐬𝐢𝐨𝐧𝐬 • Dual AI scoring. Positivity alone isn’t enough. • Final score = positivity × 0.45 + human impact × 0.55 • Batch-first AI design to reduce cost and latency. • Database-level guarantees instead of application logic. • Fully containerized services for isolation and reproducibility. 📊 𝐑𝐞𝐬𝐮𝐥𝐭 A production system running 24/7, automatically curating positive news from around the world. Total infrastructure cost under $3 per month. 🔗 In the comments I built this project to practice backend and data engineering concepts on a real-world problem. Feedback is very welcome. #Python #TypeScript #NodeJS #PostgreSQL #Docker #AWS #OpenAI #Backend
To view or add a comment, sign in
-
-
𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗮 𝘀𝗶𝗺𝗽𝗹𝗲 𝗥𝗲𝘁𝗮𝗶𝗹 𝗖𝗿𝗲𝗱𝗶𝘁 𝗔𝘀𝘀𝗲𝘀𝗲𝗺𝗲𝗻𝘁 𝘄𝗶𝘁𝗵 𝗚𝗲𝗻𝗔𝗜 I've been working on a side project to explore how LLMs can assist in 𝗥𝗲𝘁𝗮𝗶𝗹 𝗖𝗿𝗲𝗱𝗶𝘁 𝘄𝗼𝗿𝗸𝗳𝗹𝗼𝘄𝘀. 𝗜 𝗷𝘂𝘀𝘁 𝗱𝗲𝗽𝗹𝗼𝘆𝗲𝗱 𝘃𝟬.𝟬.𝟭. It's a straightforward "𝗖𝗿𝗲𝗱𝗶𝘁 𝗔𝗻𝗮𝗹𝘆𝘀𝘁 𝗛𝗲𝗹𝗽𝗲𝗿" that takes loan application data and uses GPT-3.5 to generate a quick risk summary. It’s not meant to replace anyone, but to see if we can speed up the initial data review for analysts. 𝗧𝗵𝗲 𝗧𝗲𝗰𝗵 𝗦𝘁𝗮𝗰𝗸: • 𝗙𝗿𝗼𝗻𝘁𝗲𝗻𝗱: Next.js + Tailwind • 𝗕𝗮𝗰𝗸𝗲𝗻𝗱: Python FastAPI + Google Cloud Run • 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲: Neon Postgres • 𝗔𝗜: OpenAI GPT-3.5 • 𝗔𝘂𝘁𝗵: Firebase You can test the prototype here: 𝗵𝘁𝘁𝗽𝘀://𝗿𝗲𝘁𝗮𝗶𝗹-𝗰𝗿𝗲𝗱𝗶𝘁.𝗳𝗶𝗻𝗽𝘂𝗹𝘀𝗲𝗹𝗮𝗯𝘀.𝗰𝗼𝗺 Open to feedback on the code or the concept! #𝗣𝘆𝘁𝗵𝗼𝗻 #𝗡𝗲𝘅𝘁𝗝𝗦 #𝗦𝗶𝗱𝗲𝗣𝗿𝗼𝗷𝗲𝗰𝘁 #𝗙𝗶𝗻𝗧𝗲𝗰𝗵 #𝗢𝗽𝗲𝗻𝗔𝗜 #RetailCredit #CreditRisk
To view or add a comment, sign in
-
-
Two Sum in Rust 🚀 🧠 Core Idea Here, we maintain a HashMap that stores: value → index For each element in the array: Compute its complement: complement = target − current_value Perform a hash map lookup to check whether the complement already exists. If found, we immediately return the stored index and the current index — yielding the required pair. If not found, insert the current value and its index into the hash map. This ensures: No repetition of elements Only a single traversal of the input array O(n) time complexity O(n) auxiliary space complexity ⚙️ Why This Works Instead of checking every possible pair of numbers, we keep track of the numbers we’ve already seen. For each new number, we check if the number needed to reach the target has appeared before. If it has, we return both positions right away. We start with an empty map so the same number is never used twice. This makes sure the answer is always correct and avoids repeating the same index. 🛠 Tech Stack Language: Rust Data Structure: HashMap (std::collections::HashMap) This approach is both scalable and production-grade, and it’s a great example of how fundamental data structures can unlock major performance improvements. #Rust #DataStructures #Algorithms #HashMap #LeetCode #ProblemSolving #TimeComplexity #SoftwareEngineering #CodingJourney
To view or add a comment, sign in
-
-
Big Data is the foundation. Small Data is the destination. I’ve discussed dbt models, Snowflake optimization, and the MWAA environment that orchestrates it all. But raw data, no matter how scalable, is just noise until it's rendered. Recently, I shifted my focus from the backend mechanics to the frontend physics. A database can hold a billion rows, but a human mind can only process a few distinct signals at once. In the past, the output of data engineering was a static dashboard. You delivered the chart and walked away. With the rise of GenAI and high-performance WebGL, I saw an opportunity to go further. Integrating an LLM and a physics engine requires a different approach. Data pipelines are rigid; they demand exact schemas and perfect idempotency. But the data itself is in motion. You have to apply different principles if you want to understand how it moves. I didn't just want to serve the data; I wanted to simulate its physics. I pivoted the frontend architecture to move away from rows and columns and toward "data physics": 🌎 Gravity: I use UMAP (dimensionality reduction) to group stocks by semantic similarity, not just sector codes 🏎️ Velocity: Sentiment isn't just a score; it's a vector with direction and momentum 🗿 Mass: Trade volume determines the weight and pull of a node I didn't use a massive enterprise warehouse for this layer. I stripped the stack down to the essentials: 🛠️ The Engine: Python & UMAP-Learn (handling the UMAP and physics calculations) 🧠 The Brain: OpenAI & Supabase Edge Functions (vectorizing the news) 🔎 The Lens: React + Deck.gl (rendering the physics at 60fps) We're told we need to scale up. But the most interesting engineering challenge right now is knowing how to scale down. Figuring out how to distill terabytes of noise into a single, high-fidelity signal that a human can actually use. We're used to listing coordinates; don't forget we also have the tools to draw the map. See the physics engine live: catincloud.io #aiengineering #dataengineering #genai #visualization #react #deckgl #python #scikitlearn #aws #physics
To view or add a comment, sign in
-
-
🚀 Build Update: I shipped a full-stack RAG application. I built DebugAI to challenge myself to create a production-ready AI application from scratch. The goal was simple: Create a tool that parses error logs and uses RAG (Retrieval-Augmented Generation) to find relevant Stack Overflow discussions. It’s a practical implementation of modern AI engineering patterns, focusing on performance and observability. 🛠️ The Tech Stack (The "Real" Work): Backend: FastAPI (Python) - Async architecture Database: Supabase (PostgreSQL + pgvector) Performance: Redis Caching (for instant results) Observability: Custom Cost Tracking & Analytics Search: Semantic search using OpenAI embeddings Frontend: Next.js 14 + Tailwind CSS 🔮 What's Next? This project was step one. Now that I have the foundation, I'm planning to experiment with Agentic Workflows. My next goal is to build a "Self-Evolving AI Engineer" (SEAE). The idea is to move beyond simple Retrieval-Augmented Generation (RAG) and try to build agents that can self-diagnose and learn from feedback loops. It's a big learning curve, but that's the fun part. Check out Github repository : https://lnkd.in/e5hj4gVx Check demo: https://lnkd.in/em_9MVhK #SoftwareEngineering #FastAPI #RAG #Supabase #Redis #ProjectShowcase #LearningInPublic #ai
To view or add a comment, sign in
-
-
Built a RAG Chatbot from Scratch — Designed for Real‑World Performance I recently built a Retrieval‑Augmented Generation (RAG) chatbot that answers nutrition‑related questions using a textbook as its knowledge base. This was a full end‑to‑end project: data processing, vector search, LLM integration, evaluation, and deployment. My focus throughout was reliability, explainability, and efficiency - the qualities that matter in production systems. Tech Stack • Data: 1,680 textbook chunks using sentence‑based chunking (10 sentences per chunk) • SpaCy for semantic segmentation • Embeddings: all‑mpnet‑base‑v2 (768D) • Vector Store: PostgreSQL + pgvector on Supabase (ACID‑compliant, no vendor lock‑in, IVFFlat indexing) • LLM: Gemma 7B (4‑bit quantized), running on an 8GB GPU • Frontend: Lovable Key Outcomes • 85%+ RAGAS scores across precision, recall, relevancy, and faithfulness • ~100ms query latency for 1,680 embeddings • Zero hallucinations on the test set • Clean, modular, fully reproducible pipeline • Cost‑efficient infrastructure without dedicated vector DBs What I Learned 1. Chunking strategy has a bigger impact on retrieval quality than the embedding model. Switching to sentence‑based chunks improved performance by ~20%. 2. PostgreSQL + pgvector is more than capable for workloads under 10M vectors and avoids unnecessary complexity and cost. 3. Few‑shot prompting dramatically reduces hallucinations - three curated examples brought it down from ~30% to near zero. Thanks to Raj Abhijit Dandekar and the Vizuara Technologies Private Limited team - their RAG course provided the fundamentals that shaped this build. GitHub Full code and evaluation pipeline: https://lnkd.in/eZ5GziQE If you’ve worked on RAG systems or are exploring them, I’d love to exchange insights. #MachineLearning #RAG #DataScience #LLM #AIEngineering #PostgreSQL #Python
To view or add a comment, sign in
-
-
I built a mini Pinecone from scratch in Go 🚀 Wanted to deeply understand how vector databases work under the hood, so I built one myself. What I implemented: → HNSW (Hierarchical Navigable Small World) algorithm for O(log n) similarity search → Cosine, Euclidean & Dot Product distance metrics → MongoDB-style metadata filtering ($eq, $gt, $in, $and, $or...) → Binary disk persistence with index serialization → OpenAI embedding integration for text-to-vector → REST API + CLI interface The interesting parts: The HNSW algorithm is fascinating - it builds a multi-layer graph where higher layers act as "express lanes" for navigation. Search starts at the top and greedily descends, achieving approximate nearest neighbor in logarithmic time. For persistence, I designed a custom binary format that stores vectors and serializes the entire HNSW graph structure, so the index doesn't need rebuilding on restart. Tech stack: Pure Go with minimal dependencies (just godotenv + gorilla/mux) What I learned: Why approximate search beats exact search at scale How graph-based indices outperform tree-based ones for high dimensions The trade-offs between recall, speed, and memory in ANN algorithms Vector databases aren't magic - they're elegant algorithms solving the curse of dimensionality. Code is open source. Link: https://lnkd.in/g5e6qC-P #golang #vectordatabase #machinelearning #systemdesign #opensource
To view or add a comment, sign in
-
🛑 𝐓𝐇𝐄 𝐄𝐍𝐃 𝐎𝐅 𝐌𝐀𝐍𝐔𝐀𝐋 𝐃𝐀𝐓𝐀 𝐒𝐂𝐑𝐈𝐏𝐓𝐈𝐍𝐆. 🕵️♀️ Why spend hours writing boilerplate code when AI can build your entire data pipeline in seconds? I built 𝐃𝐚𝐭𝐚𝐒𝐜𝐨𝐮𝐭 — an end-to-end Intelligence Platform that turns 𝒏𝒂𝒕𝒖𝒓𝒂𝒍 𝒍𝒂𝒏𝒈𝒖𝒂𝒈𝒆 into 𝒑𝒓𝒐𝒅𝒖𝒄𝒕𝒊𝒐𝒏-𝒓𝒆𝒂𝒅𝒚 𝒔𝒄𝒓𝒊𝒑𝒕𝒔 for 𝐒𝐐𝐋, 𝐏𝐲𝐭𝐡𝐨𝐧, and 𝐂++. Most tools stop at "Text-to-SQL." I took it further. Whether you are a Business Lead looking for insights or a Developer needing a C++, SQLite header, DataScout delivers. 𝐓𝐡𝐞 𝐏𝐨𝐰𝐞𝐫 𝐨𝐟 𝐃𝐚𝐭𝐚𝐒𝐜𝐨𝐮𝐭: 🗣️ 𝐂𝐨𝐧𝐯𝐞𝐫𝐬𝐚𝐭𝐢𝐨𝐧𝐚𝐥 𝐈𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 : Leverages Google Gemini 1.5 Flash to translate human thought into complex logic. 🛠️ 𝐌𝐮𝐥𝐭𝐢-𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐒𝐜𝐫𝐢𝐩𝐭𝐢𝐧𝐠 𝐄𝐧𝐠𝐢𝐧𝐞 : Instantly generates and exports production-ready code in SQL, Python (Pandas), and C++ (SQLite). Build once, deploy anywhere. 💡 𝐀𝐮𝐭𝐨𝐦𝐚𝐭𝐞𝐝 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐬𝐢𝐠𝐡𝐭𝐬 : It doesn’t just show data; an integrated AI Analyst Agent interprets the results to provide executive-level strategy points. 📈 𝐈𝐧𝐬𝐭𝐚𝐧𝐭 𝐕𝐢𝐬𝐮𝐚𝐥 𝐒𝐭𝐨𝐫𝐲𝐭𝐞𝐥𝐥𝐢𝐧𝐠 : Detects trends and renders high-impact interactive charts automatically. 🔍 𝐃𝐞𝐞𝐩 𝐃𝐚𝐭𝐚 𝐇𝐞𝐚𝐥𝐭𝐡 𝐀𝐮𝐝𝐢𝐭𝐬 : A "zero-trust" profiling system that audits every file for missing values and quality issues before you start. 🕒 𝐒𝐦𝐚𝐫𝐭 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐇𝐢𝐬𝐭𝐨𝐫𝐲 : Remembers your analytical path, allowing for rapid-fire iterative data exploration. 𝑺𝒕𝒐𝒑 𝒃𝒆𝒊𝒏𝒈 𝒕𝒉𝒆 𝒃𝒐𝒕𝒕𝒍𝒆𝒏𝒆𝒄𝒌. 𝑺𝒕𝒂𝒓𝒕 𝒃𝒆𝒊𝒏𝒈 𝒕𝒉𝒆 𝒂𝒓𝒄𝒉𝒊𝒕𝒆𝒄𝒕.🚀 📂 GitHub: https://lnkd.in/gQAVsMA2 🌐 Live App: https://lnkd.in/gunwxYyG #GenerativeAI #DataEngineering #Python #Cplusplus #Streamlit #GeminiAI #DataScience #Automation #SQL #BTech2026 #SoftwareDevelopment
To view or add a comment, sign in
-
🚀 FastAPI Async: async def (The Definition): Defines a "Coroutine." It tells FastAPI that this function will perform I/O operations and should not block the main execution thread. await (The Yield): The pause button. It tells the Event Loop: "I’m waiting for data (DB/API). Go handle other incoming requests while I sit idle." asyncio.gather (Parallel execution): Used to fire multiple coroutines at once. It returns a list of results only after the slowest task completes. asyncio.TaskGroup (The 2026 Standard): Introduced in Python 3.11 for "Structured Concurrency." It manages multiple tasks safely; if one task fails, it cleans up the others automatically to prevent "zombie" processes. 📂 When to use async def vs. def Use async def for I/O-Bound tasks:- Calling external REST APIs (via httpx). Querying databases (via asyncpg, motor, or SQLAlchemy async sessions). Reading/writing files or interacting with cloud storage (S3/GCS). Use def (standard) for CPU-Bound tasks:- Heavy data manipulation (Pandas, NumPy). Image processing or machine learning model inference. Using legacy "blocking" libraries (like requests or psycopg2). Note: FastAPI automatically offloads these to a separate threadpool so they don't freeze your API. 💡 Best Practices & Golden Rules No time.sleep in Async: Never use time.sleep() inside an async def block. It stops the entire event loop (and your whole server). Use await asyncio.sleep() instead. Use Async Drivers: You only get the performance benefits of async if your database driver is also asynchronous. Using a blocking driver inside an async function is a performance bottleneck. Leverage TaskGroup for Reliability: Move away from gather for complex workflows. TaskGroup provides better error handling and ensures that if one part of your parallel logic crashes, the whole group is handled gracefully. Keep it Non-Blocking: If you have a massive for loop that takes 2 seconds to calculate something, don't put it in an async def. Move that logic to a def function or a background task.
To view or add a comment, sign in
Explore related topics
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development
👏