Fix Slow APIs with EXPLAIN ANALYZE, Not New Infrastructure

1w Edited

I've been heads-down building for a while and haven't shared much here. Changing that. Here's something I keep running into: teams reach for new infrastructure when the real problem is their queries. At a previous company, our API response times were climbing. The conversation started drifting toward caching layers, read replicas, maybe a new service. Before any of that, I spent a day with EXPLAIN ANALYZE and pg_stat_statements. What I found: → A few joins were scanning full tables because indexes didn't match the actual query patterns in production → One N+1 had been there so long everyone assumed "that endpoint is just slow" → A couple of queries were sorting in Ruby that PostgreSQL could have sorted faster itself Three changes. No new infrastructure. API response times dropped by over 60%. The lesson I keep relearning: Most performance problems aren't architecture problems. They're query problems. And query problems are cheap to fix if you measure before you redesign. If your API feels slow, run EXPLAIN ANALYZE before you add a service. You might save yourself months. Everyone's talking about AI-powered observability. Meanwhile, EXPLAIN ANALYZE is free and tells you exactly what's wrong. #postgresql #backendengineering #softwareengineering

To view or add a comment, sign in

More Relevant Posts

Chandana Bhargav
2w
Report this post
One of the most common performance mistakes I've seen in backend codebases: Over-indexing. The assumption: more indexes = faster queries. In practice, not really. What actually happens: -> Every write (INSERT / UPDATE / DELETE) updates every index - write-heavy tables take a real hit -> The query planner may ignore your index and choose a sequential scan if it's cheaper -> Index bloat quietly eats disk and slows down maintenance like VACUUM I've seen cases where just removing unused indexes improved write latency noticeably - especially on write-heavy tables. What works better: -> EXPLAIN ANALYZE first - understand what PostgreSQL is actually doing -> Index columns you actually filter and sort on -> Use composite indexes with the right column order -> Drop unused indexes (use pg_stat_user_indexes) Indexes should be intentional, not automatic. What's the worst indexing mistake you've seen? 👇 #PostgreSQL #BackendEngineering #SystemDesign #SoftwareEngineering

1 Comment
Like Comment
To view or add a comment, sign in
Adwait Wadekar
2w
Report this post
"Just use Postgres." Modern software engineering has become a subscription management simulator. I finally stopped the madness and consolidated most of my specialized infrastructure into a single source of truth: Postgres. Postgres has been in active development for three decades. It is basically the Skyrim of databases, a rock-solid foundation you can mod until it replaces your entire stack. The technical reality: NoSQL: JSONB + GIN indices give you document-store flexibility with ACID compliance. Search: TS_VECTOR handles full-text search. I am glad I am not the only one who realized Elasticsearch is usually an expensive layer of overkill. Vector DB: pgvector with HNSW indices solves the hybrid search problem natively. Message Queue: FOR UPDATE SKIP LOCKED creates reliable queues without adding a new service. Time Series: Partitioning + BRIN indices handle massive telemetry without the B-tree bloat. API Layer: Row-Level Security (RLS) can eliminate hundreds of lines of boilerplate middleware. The result is one connection string, one backup strategy, and zero distributed consistency headaches. Stop over-engineering for Google-scale problems you do not have yet. Pick the tool that has been battle-tested since the 90s and just start shipping.

1 Comment
Like Comment
To view or add a comment, sign in
Bhaskar Purohit
2w
Report this post
Our API response time jumped from 120ms → 600ms overnight. No code deployed. No infra change. No incidents reported. Just... slower. Here’s how I debugged it in 40 minutes 👇 Step 1: Isolate the symptom CloudWatch showed the spike started at 11:42 PM. But here’s the interesting part: P95 latency spiked P50 stayed normal That usually means: Large payloads Heavy queries Edge-case traffic Not a full-system slowdown. Step 2: Eliminate the usual suspects I checked the obvious first: Lambda cold starts? ❌ Warm instances were also slow DB connection pool? ❌ Only 42% utilized External APIs? ❌ Not in the request path That narrowed it down to one likely culprit: ➡️ The database query itself. Step 3: Inspect the query plan Ran EXPLAIN ANALYZE on the main trade lookup query. Result: Sequential scan on 2.1M rows Estimated cost: 48,000 Index was no longer being chosen Why? As the table grew, PostgreSQL recalculated cost estimates and changed the execution plan automatically. Silent. Invisible. Expensive. Step 4: Fix it Added a composite index on: (user_id, created_at DESC) Immediately after: Query planner switched to Index Scan P95 dropped from 600ms → 89ms The real lesson Your system can break without deployments. Because performance bugs often come from: Data growth Query planner decisions Traffic shape changes Hidden thresholds EXPLAIN ANALYZE isn’t just an optimization tool. It’s a production survival tool. And if you’re not tracking P95 latency, you’re blind to what power users are experiencing. My takeaway As systems scale, the code may stay the same — but behavior changes. That’s where engineering gets interesting. Curious: what’s the sneakiest production bug you’ve debugged? Drop it in the comments 👇 (Real stories only — those are always the best lessons.) If this was useful, repost it so more engineers see it. #PostgreSQL #BackendEngineering #NodeJS #SystemDesign #SoftwareEngineering #Debugging #AWS #Performance #DevOps
Like Comment
To view or add a comment, sign in
Ankit Barik
3w
Report this post
🚀 Rethinking Backend Responsibility with Row Level Security (RLS) Recently, I explored how Row Level Security (RLS) in PostgreSQL can fundamentally change the way we design application backends. Traditionally, access control is handled at the backend layer — APIs decide what a user can read or modify. But with RLS, this responsibility can be enforced directly at the database level. You can define fine-grained policies that control which rows a user is allowed to access. So even if a client communicates directly with the database, it doesn’t imply unrestricted access. 👉 The database itself becomes the gatekeeper. This led to an interesting realization: • Not every application needs a heavy backend for simple CRUD use cases • Direct client → database interaction can be safe with properly defined policies • Security is not about hiding the database, but about controlling access That said, backend systems are still essential for: • Rate limiting • Caching • Queues • Complex business logic So it’s not about eliminating the backend — it’s about choosing the right level of abstraction based on the problem. This perspective really helped me think more clearly about when to rely on backend logic and when to push responsibility into the database. Learned this while following insights from Piyush Garg sir and exploring the official docs: 📖 https://lnkd.in/gF5kU4E9 Still exploring deeper into policy design and real-world applications of RLS. #postgresql #backend #systemdesign #security #webdevelopment #developers
Like Comment
To view or add a comment, sign in
Sk Nazibul Hossain
1mo
Report this post
🚀 𝐁𝐮𝐢𝐥𝐭 𝐚 𝐏𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧-𝐒𝐭𝐲𝐥𝐞 𝐁𝐚𝐜𝐤𝐞𝐧𝐝 𝐢𝐧 𝐆𝐨 𝐰𝐢𝐭𝐡 𝐏𝐨𝐬𝐭𝐠𝐫𝐞𝐒𝐐𝐋 (𝐃𝐨𝐜𝐤𝐞𝐫𝐢𝐳𝐞𝐝) Recently I moved from an in-memory store to a real DB and went beyond basic DB connectivity to build a near production-style backend service in Go. 🔧 𝐓𝐞𝐜𝐡 𝐒𝐭𝐚𝐜𝐤: Go (net/http, database/sql) PostgreSQL (running via Docker) REST API 🧱 𝐖𝐡𝐚𝐭 𝐈 𝐢𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭𝐞𝐝: 🔹 Dockerized Database Ran PostgreSQL using Docker 🔹 Clean Architecture Structured project as: main → handler → store → database Separated HTTP logic from database layer 🔹 Database Layer Used INSERT, RETURNING id for efficient writes Implemented: QueryRow for single-row queries Query for multi-row queries 🔹 Production Practices Context-aware DB calls (context.WithTimeout) Connection pooling (SetMaxOpenConns, etc.) Proper error handling (avoiding log.Fatal in business logic) 🔹 API Endpoints POST /products → create product GET /products → fetch all products 💡 𝐊𝐞𝐲 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠𝐬: Difference between driver vs database/sql Why RETURNING matters in PostgreSQL How real backend services are structured 📈 𝐖𝐡𝐚𝐭’𝐬 𝐧𝐞𝐱𝐭: Transactions (for real-world scenarios like payments) Exploring pgx for high-performance database access This project helped me bridge the gap between “it works” and “it’s production-ready.” #golang #postgresql #docker #backend #softwareengineering #learninginpublic
Like Comment
To view or add a comment, sign in
Trilok B.
2w Edited
Report this post
I accidentally created a 𝗱𝗲𝗮𝗱𝗹𝗼𝗰𝗸 in PostgreSQL while testing my ticket booking system. I was experimenting with `SELECT ... FOR UPDATE` and ran different queries in parallel. At one point, I ended up with this pattern: - Transaction 1 locked seat 3 → then tried to update seat 1 - Transaction 2 locked seat 1 → then tried to access seat 3 That was enough to create a circular wait. Both transactions blocked each other, and Postgres resolved it by killing one: `ERROR: deadlock detected` This wasn’t a database issue. It was a logic problem. If a transaction needs to lock multiple rows, it must do so in a consistent order across all transactions. For example, if a user selects multiple seats: ```sql SELECT * FROM seats WHERE id IN (<selected_seat_ids>) ORDER BY id ASC FOR UPDATE; ``` This ensures every transaction locks rows in the same sequence, preventing circular waits. What I took away from this: - Deadlocks don’t need scale — even small experiments can trigger them - They’re not random bugs — they’re a direct result of circular wait conditions - If your transaction touches multiple rows, ordering matters Also learned: once a transaction fails, it’s unusable until rollback (`!#` in psql). Chai Aur Code Hitesh Choudhary Piyush Garg #WebDevelopment #PostgreSQL #BackendEngineering #Docker #Database
Like Comment
To view or add a comment, sign in
Kishore Kumar
5d
Report this post
𝗢𝗙𝗙𝗦𝗘𝗧 𝗽𝗮𝗴𝗶𝗻𝗮𝘁𝗶𝗼𝗻 𝘄𝗼𝗿𝗸𝘀 𝗴𝗿𝗲𝗮𝘁 — 𝘂𝗻𝘁𝗶𝗹 𝘆𝗼𝘂𝗿 𝘁𝗮𝗯𝗹𝗲 𝗴𝗿𝗼𝘄𝘀. 𝗧𝗵𝗲𝗻 𝗶𝘁 𝗯𝗲𝗰𝗼𝗺𝗲𝘀 𝗼𝗻𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗲𝘅𝗽𝗲𝗻𝘀𝗶𝘃𝗲 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗮𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻. Here is why it breaks, and the pattern that fixes it permanently. 𝗛𝗼𝘄 𝗢𝗙𝗙𝗦𝗘𝗧 𝗽𝗮𝗴𝗶𝗻𝗮𝘁𝗶𝗼𝗻 𝘄𝗼𝗿𝗸𝘀 The concept is simple: to get page N, skip (N−1) × pageSize rows and return the next batch. Every ORM generates this automatically. It looks completely harmless. 𝗪𝗵𝘆 𝗶𝘁 𝗶𝘀 𝗮 𝘁𝗶𝗺𝗲 𝗯𝗼𝗺𝗯 OFFSET does not skip rows cheaply. PostgreSQL must physically read and discard every row before the offset position. Page 100 means 1,980 wasted row reads. Page 500 means 9,980. There is no index that fixes this — it is a fundamental property of how OFFSET works. On a 200,000-row table: → Page 1: 1.2ms → Page 100: 22ms → Page 500: 108ms → Page 2,500: 520ms At 1 million rows, deep pages take seconds. Every single request. There is a second problem: 𝗿𝗼𝘄 𝗱𝗿𝗶𝗳𝘁. If a new record is inserted while a user is paginating, every subsequent offset shifts by one. Records are silently skipped or shown twice. No error. No warning. 𝗧𝗵𝗲 𝗳𝗶𝘅: 𝗰𝘂𝗿𝘀𝗼𝗿 𝗽𝗮𝗴𝗶𝗻𝗮𝘁𝗶𝗼𝗻 Instead of counting rows to skip, remember where you left off. Use the last seen row's value as a boundary in the WHERE clause. PostgreSQL jumps directly to the right position using an index — no rows discarded, no drift, O(1) regardless of how deep into the dataset you are. The next page request simply sends the ID (or timestamp) of the last item returned. Page 1 and page 50,000 cost exactly the same. 𝗪𝗵𝗲𝗻 𝘁𝗼 𝘀𝘄𝗶𝘁𝗰𝗵 OFFSET is fine under ~25,000 rows. The moment you expect significant growth, add cursor support now. Retrofitting it later means changing the API contract for every client that consumes the endpoint — a far more painful migration. 𝗙𝘂𝗹𝗹 𝗡𝗲𝘀𝘁𝗝𝗦/𝗣𝗿𝗶𝘀𝗺𝗮 𝗶𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻, 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀, 𝗮𝗻𝗱 𝗮 𝗰𝗼𝗺𝗽𝗮𝗿𝗶𝘀𝗼𝗻 𝘁𝗮𝗯𝗹𝗲 𝗶𝗻 𝘁𝗵𝗲 𝗣𝗗𝗙 𝗮𝘁𝘁𝗮𝗰𝗵𝗲𝗱. 🔖 Save this before your next API review. ♻️ Repost to help someone avoid a painful production incident. #PostgreSQL #Pagination #BackendEngineering #APIDesign #PerformanceTuning #NestJS #Prisma #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Emmanuel Oyewole
2w
Report this post
Multi-master replication in Postgres sounds great until you hit the operational complexity. This is a solid attempt at abstracting that away and making it usable in real systems. Definitely worth checking out if you care about scaling beyond a single writer.
Bukola Sobowale (GMNSE)

Infrastructure Engineer
2w Edited

PostgreSQL has natively supported logical replication since version 10. But while those foundational primitives have been there for years, actually configuring a true, active-active multi-master setup has remained painfully manual. Once you choose to move beyond the single-writer bottleneck, you immediately encounter the challenge of full-mesh topology. You have to manually configure, sync, and handle conflict resolution across an increasingly fragile web of database nodes. Reading through the OpenAI blog released in January 2026 reinforced my thoughts on the complexity of multi-master systems and why many teams avoid them. Before that blog release, I had been tinkering with a simple multi-writer system on Docker Desktop from my personal computer. The result of that experimentation led me to build pgconverge. Pgconverge is an open-source CLI tool designed to automate multi-master logical replication in Postgres. It abstracts away the heavy lifting of node synchronisation and the dreaded n(n-1) complexity so you can focus on scaling your infrastructure, not writing custom replication scripts. I have documented what I learned while trying to build Pgconverge into a 7-part series. I have released the first two articles and will be rolling out the remaining five over the coming days. GitHub: https://lnkd.in/e74jx7hu Why Multi-Master? The Problem with Single-Writer Databases: https://lnkd.in/esavkuhu Inside Pgconverge: Navigating the N×(N-1) Complexity of Full Mesh Replication: https://lnkd.in/eSDNMfrp You can also read OpenAI’s blog on how they scaled a single-writer PostgreSQL database to power ChatGPT at massive scale : https://lnkd.in/ew2U58Ct I would love for fellow infrastructure and backend engineers to break it, test it, and share feedback. #PostgreSQL #DistributedSystems #DatabaseArchitecture #BackendEngineering #SystemDesign
Like Comment
To view or add a comment, sign in
Subham Sahoo
1w
Report this post
I usually just spin up MongoDB and call it a day. However, I wanted to explore how databases work under the hood, so I built a mini-database from scratch using Node.js and a plain .txt file. This project initially seemed a bit crazy, but it forced me to learn about: - Using Node streams to prevent memory crashes - Safely updating and deleting records in a flat file - The importance of basic indexing as a lifesaver You don't truly understand a tool until you attempt to build a basic version of it yourself. I wrote a quick breakdown of the code and what I learned. #Nodejs #Backend #SystemDesign https://lnkd.in/gcdZ2yQj

Building a Mini Database with Node.js Streams (Using Just a .txt File) medium.com
Like Comment
To view or add a comment, sign in
Umar Ashraf Lone
3w
Report this post
Sometimes everything in your system works fine. Then one day, traffic spikes… and multiple requests try to update the same data at the same time. Now you get weird issues: Duplicate orders Overbooked seats Negative inventory Not because of bugs. Because of concurrent updates. --- This is where Distributed Locking comes in The idea is simple: Only one process should modify a resource at a time. Everyone else has to wait. --- What actually happens Let’s say two requests try to update the same product stock. Without locking: Both read stock = 10 Both reduce it Final value becomes wrong With locking: First request gets the lock Second request waits Updates happen safely --- Where this is used Payment processing Inventory management Booking systems Scheduled jobs Anywhere consistency matters. --- Common ways to implement Database locks Simple, but can affect performance. Redis locks (like Redisson) Fast and commonly used in distributed systems. Zookeeper / etcd Used in large-scale systems. --- Why this matters In distributed systems: Multiple instances run in parallel Race conditions are common Data can get corrupted silently Locks help keep things consistent. --- But be careful Locks can slow things down. If not handled properly, they can even cause deadlocks. Use them only where necessary. --- Simple takeaway When multiple processes touch the same data, coordination becomes essential. --- Where in your system could two requests clash at the same time without you noticing? #Java #SpringBoot #Programming #SoftwareDevelopment #Cloud #AI #Coding #Learning #Tech #Technology #WebDevelopment #Microservices #API #Database #SpringFramework #Hibernate #MySQL #BackendDevelopment #CareerGrowth #ProfessionalDevelopment #RDBMS #PostgreSQL #backend
Like Comment
To view or add a comment, sign in

4,693 followers

13 Posts

View Profile Follow

Fix Slow APIs with EXPLAIN ANALYZE, Not New Infrastructure

More Relevant Posts

Explore related topics

Explore content categories