PostgreSQL Slow Query Monitoring Best Practices

33 followers

By default, PostgreSQL does not log slow queries. The `log_min_duration_statement` parameter ships at `-1` (disabled), which means every performance regression happens in the dark until a user complains or an application timeout fires. The worst part? By the time you notice, the query has been degrading for months. A query that took 50ms six months ago now takes 500ms, and the root cause is buried under months of schema changes and data growth. That is what silent performance drift looks like. Three things I see teams get wrong with slow query monitoring: **1. The threshold is either too high or too low.** Setting `log_min_duration_statement` to 10 seconds catches only catastrophic queries. Meanwhile, a stream of 1-2 second queries collectively dominates your database load. Start at 250ms for transactional workloads -- it captures meaningful slowness without flooding logs. **2. They optimize outliers instead of total load.** A single 800ms query is less important than a 50ms query running 10,000 times per hour (500 seconds of total load). Use `pg_stat_statements` and sort by `total_exec_time`, not `max_exec_time`. The standard deviation column (`stddev_exec_time`) reveals plan instability -- queries that are sometimes fast and sometimes slow. **3. They never enable auto_explain.** `log_min_duration_statement` tells you which queries are slow. `auto_explain` tells you why. Set `auto_explain.log_min_duration = 500ms` to automatically capture execution plans for slow queries. Set `log_analyze = off` in production to avoid doubling execution cost -- the estimated plan is enough for diagnosis. Slow query logging, pg_stat_statements, and auto_explain form a three-layer observability stack that catches regressions before users notice. I wrote a practical guide with the exact configuration, detection queries, and a prevention strategy: https://lnkd.in/exV6FcAq #PostgreSQL #DatabasePerformance #SlowQueries #DBA #DevOps #SoftwareEngineering

To view or add a comment, sign in

More Relevant Posts

OpenSource DB

1,893 followers
3w
Report this post
Pop quiz: what happens when Postgres runs out of transaction IDs? A) Queries slow down B) You get a warning in the logs C) The entire database goes read-only and your application stops working The answer is C. And it's not a theoretical edge case. It's a countdown that has ticked on every Postgres database that isn't vacuuming properly. Welcome to Week 2 of April Data Drops and its our very own birthday gal Chandini kurada talking about VACUUM. Here's the short version: Postgres uses a multi-version concurrency model. When you update or delete a row, the old version sticks around. These dead tuple versions pile up. VACUUM is the process that cleans them out. Autovacuum does this automatically. In theory. In practice, autovacuum's defaults are tuned for politeness, not performance. On big, busy tables, it falls behind. Dead tuples stack up. Tables bloat. Performance degrades gradually — so gradually you don't notice until it's a crisis. And if VACUUM falls far enough behind, Postgres starts running out of usable transaction IDs. When the counter gets close to wrapping around, Postgres does the only safe thing it can: It shuts down all writes. Completely. We saw a SaaS company 48 hours away from this exact scenario. A 200GB table. Autovacuum falling behind for weeks. Nobody noticed until the warning showed up in the logs. 48 hours from total write shutdown on a production database. Today's video: → How VACUUM and autovacuum actually work under the hood → Queries to check if your tables are falling behind right now → Which autovacuum settings to tune (and what to set them to) → The wraparound doomsday clock and how to keep it far from midnight VACUUM is not optional. It's oxygen. #AprilDataDrops #PostgreSQL #DataDrop8 #VACUUM #Database #DevOps #OpenSourceDB

1 Comment
Like Comment
To view or add a comment, sign in
Postgres Women India

1,997 followers
4d Edited
Report this post
Shoutout to Chandini kurada from OpenSource DB for breaking down a critical topic that often gets ignored until it’s too late. Pop quiz: what happens when Postgres runs out of transaction IDs? A) Queries slow down B) You get a warning in the logs C) The entire database goes read-only and your application stops working The answer is C. And it’s not theoretical — it’s a ticking clock in every system where VACUUM isn’t keeping up. Most teams rely on autovacuum and assume it will handle everything. But in real-world workloads, it often falls behind. Dead tuples build up. Tables bloat. Performance drops — slowly, then suddenly. And if it falls too far behind? Postgres runs out of usable transaction IDs → and shuts down all writes. Completely. In this Data Drop, Chandini covers: → How VACUUM and autovacuum work under the hood → How to check if your tables are falling behind → What to tune in autovacuum → How to avoid transaction ID wraparound VACUUM is not optional. It’s oxygen. We are happy to share this #AprilDataDrops initiative of our supporting partner OpenSource DB — one PostgreSQL video, every day this April. Check all Data Drop videos here: https://lnkd.in/gqidmmQp Aarti NR | Kalyani M | Keerthi Seetha | Praveena Sivasankar #PostgresWomenIndia #AprilDataDrops #PostgreSQL #DataDrop8 #VACUUM #autovacuum #performance #WomenInTech

OpenSource DB

1,893 followers
3w

Pop quiz: what happens when Postgres runs out of transaction IDs? A) Queries slow down B) You get a warning in the logs C) The entire database goes read-only and your application stops working The answer is C. And it's not a theoretical edge case. It's a countdown that has ticked on every Postgres database that isn't vacuuming properly. Welcome to Week 2 of April Data Drops and its our very own birthday gal Chandini kurada talking about VACUUM. Here's the short version: Postgres uses a multi-version concurrency model. When you update or delete a row, the old version sticks around. These dead tuple versions pile up. VACUUM is the process that cleans them out. Autovacuum does this automatically. In theory. In practice, autovacuum's defaults are tuned for politeness, not performance. On big, busy tables, it falls behind. Dead tuples stack up. Tables bloat. Performance degrades gradually — so gradually you don't notice until it's a crisis. And if VACUUM falls far enough behind, Postgres starts running out of usable transaction IDs. When the counter gets close to wrapping around, Postgres does the only safe thing it can: It shuts down all writes. Completely. We saw a SaaS company 48 hours away from this exact scenario. A 200GB table. Autovacuum falling behind for weeks. Nobody noticed until the warning showed up in the logs. 48 hours from total write shutdown on a production database. Today's video: → How VACUUM and autovacuum actually work under the hood → Queries to check if your tables are falling behind right now → Which autovacuum settings to tune (and what to set them to) → The wraparound doomsday clock and how to keep it far from midnight VACUUM is not optional. It's oxygen. #AprilDataDrops #PostgreSQL #DataDrop8 #VACUUM #Database #DevOps #OpenSourceDB
Like Comment
To view or add a comment, sign in
Rene' Cannao'
6d
Report this post
🐘 PgBouncer is great — but it’s not the whole story. If you run PostgreSQL, chances are you’re using PgBouncer for connection pooling. It’s simple, efficient, and does one thing very well. But at some point, you start hitting limitations: - no query routing - no read/write split - no visibility into traffic - limited control beyond pooling That’s exactly why we wrote this post: 👉 moving from PgBouncer to ProxySQL (for PostgreSQL) ProxySQL is not just a pooler. It’s a SQL-aware proxy that can: - route queries based on rules - split reads/writes - multiplex connections - integrate with HA setups - provide observability So the real question becomes: 👉 when is PgBouncer enough, and when do you need more? This post from Rahim Kanji is the first in a series exploring that transition. 📖 https://lnkd.in/g9H3uVuh Curious to hear from PostgreSQL users: are you hitting limits with PgBouncer? or is it still “good enough” for your use case? #PostgreSQL #PgBouncer #ProxySQL #DevOps #SRE #Database #OpenSource

Part 1 - PgBouncer to ProxySQL: Rethinking the PostgreSQL Middle Tier — ProxySQL Blog proxysql.com

5 Comments
Like Comment
To view or add a comment, sign in
MyDBA.dev

33 followers
4w
Report this post
pgwatch is one of the best open-source PostgreSQL monitoring tools out there. It is mature, flexible, and genuinely excellent at what it does. Any SQL query becomes a metric. It scales to hundreds of instances. It plugs straight into your existing Grafana stack. So why did we build myDBA.dev? Because we kept seeing the same pattern: teams collect metrics beautifully but struggle with the "now what?" part. A Grafana dashboard shows n_dead_tup at 2.4 million. But is autovacuum disabled? Is a long-running transaction blocking it? What is the exact ALTER TABLE command to fix it? The gap between seeing a number and knowing what to do about it is where hours of debugging live. Here is what we built to close that gap: 1. 75+ health checks that generate copy-pasteable SQL fixes calculated from your actual server configuration -- not generic recommendations 2. An index advisor that analyzes your query workload and produces CREATE INDEX or DROP INDEX CONCURRENTLY statements with estimated impact 3. Automatic EXPLAIN plan collection so you can compare old and new plans when a query regresses 4. Extension monitoring for TimescaleDB, pgvector, and PostGIS -- domains pgwatch does not cover 5. XID wraparound detection that identifies blockers and generates recovery scripts specific to your situation pgwatch answers "what is happening in my database?" We answer "what is happening, what does it mean, and what should I do about it?" Both are valid approaches. The right choice depends on whether your team needs data or direction. Full comparison: https://lnkd.in/eDkhbnVh #PostgreSQL #OpenSource #DatabaseMonitoring #pgwatch #DevOps
Like Comment
To view or add a comment, sign in
Philip McClarence
1w
Report this post
Monitor your WAL generation rate before it becomes a problem. Every write in PostgreSQL goes through the Write-Ahead Log. Every byte of WAL must be written to disk, sent to replicas, and archived for backups. High WAL generation means high I/O, replication lag, and backup bloat. On PostgreSQL 14+: SELECT wal_records, wal_fpi, pg_size_pretty(wal_bytes) AS total_wal_generated, stats_reset FROM pg_stat_wal; `wal_fpi` (full page images) is the interesting one. After every checkpoint, the first modification to each page writes the entire 8 KB page to WAL — not just the change. This is a safety mechanism, but it means frequent checkpoints generate much more WAL. Check your checkpoint frequency: SELECT checkpoints_timed, checkpoints_req, pg_size_pretty(buffers_checkpoint * 8192::bigint) AS checkpoint_data FROM pg_stat_bgwriter; If `checkpoints_req` (forced checkpoints) is high relative to `checkpoints_timed` (scheduled checkpoints), your `max_wal_size` is too low. The default of 1 GB is often insufficient for production workloads. A good starting point: max_wal_size = 4GB min_wal_size = 1GB Check this weekly. Sudden spikes in WAL generation usually mean something changed — a bulk operation, a new feature with heavy writes, or a configuration regression. #PostgreSQL #Database #WAL #Performance #DevOps

7 Comments
Like Comment
To view or add a comment, sign in
Matty Stratton
1w
Report this post
Every Postgres queue built on SKIP LOCKED + DELETE eventually turns into a VACUUM problem. You ship it, it works for a week, and then dead tuples start piling up. Index bloat. Autovacuum chasing its tail. The dashboard that was green last Tuesday is suddenly the reason you're on a call at 9pm. Nikolay Samokhvalov just shipped PgQue, a resurrection of PgQ (the queue architecture Skype built for messaging hundreds of millions of users back in the day). Pure PL/pgSQL. One SQL file. pg_cron to tick. No sidecar daemon. The trick is snapshot-based batching plus TRUNCATE-based table rotation instead of per-row deletes. Rotate partitions, truncate the old one, done. No bloat because there are no dead tuples to clean up. The tradeoff is end-to-end delivery latency in the 1-2 second range, which for plenty of workloads is fine. https://lnkd.in/gC6HTbfP This is the kind of thing I love about the Postgres ecosystem. Someone looked at an architectural pattern that's been quietly battle-tested for a decade, noticed the zero-bloat property, and packaged it as SQL you can read in an afternoon. No new infrastructure. No vendored runtime. Just the database you already run. What's your current queue setup looking like? #PostgreSQL #DatabaseEngineering #Backend #OpenSource

GitHub - NikolayS/pgque: PgQue – Zero-bloat Postgres queue. One SQL file to install, pg_cron to tick. github.com

1 Comment
Like Comment
To view or add a comment, sign in
Philip McClarence
3w
Report this post
If you're running PostgreSQL without EXPLAIN plan tracking, you're flying blind. I know, I know. You've got query duration metrics. You've got pg_stat_statements. You know which queries are slow. But do you know WHY they're slow? Execution time tells you there's a problem. The query plan tells you what the problem actually is. A sequential scan on a 50 million row table. A nested loop where a hash join would be 100x faster. An index that exists but the planner decided not to use it. Here's the thing that catches people: plans change silently. You deploy new code that inserts a million rows. Autovacuum runs and updates statistics. Your table crosses a size threshold where the planner switches strategies. The exact same SQL query -- character for character -- can have a completely different plan on Monday than it did on Friday. And you won't notice until users start complaining. What makes this hard is that plans are invisible by default. PostgreSQL doesn't log them. It doesn't track them over time. It doesn't tell you when a plan changes. You have to ask explicitly, and by the time you think to ask, the damage is usually done. I've watched teams spend hours debugging a "slow database" by staring at pg_stat_statements numbers, tweaking shared_buffers, even adding hardware -- when the root cause was a single query that switched from an index scan to a sequential scan after a statistics update. Five minutes with EXPLAIN would have found it. Most teams only look at EXPLAIN when something is already on fire. By then you're doing forensics, not prevention. #PostgreSQL #EXPLAIN #QueryOptimization #Database #DevOps
1 Comment
Like Comment
To view or add a comment, sign in
Michael Shumilov
2w
Report this post
🚀 SequelPG v0.11.1 is live If you work with PostgreSQL every day, this will feel familiar: You run a query You tweak it You come back to something from yesterday You try to remember what actually worked Most tools treat query history as just a log. I don’t think that’s enough. In this release, I rebuilt Query History from scratch. Now it’s something you actually use: Quickly find past queries Reuse them without rewriting Debug faster with less context switching I also refactored the Database Tools layer. You won’t “see” most of it — but you’ll feel it: More consistency Better performance Stronger foundation for what’s coming next I’m not trying to add more features. I’m trying to reduce friction when working with data. Full release notes: https://lnkd.in/dFmaV_xH If you use PostgreSQL, I’d really value your feedback. #PostgreSQL #DeveloperTools #IndieHacker #BuildInPublic #SwiftUI #DX

SequelPG v0.1.11: database tools, query history, and a catalog-query security pass sequelpg.com
Like Comment
To view or add a comment, sign in
Rahim Kanji
6d
Report this post
🧩PgBouncer is simple. The stack around it usually isn't. That's not a criticism of PgBouncer. It does its job well. The issue is that, in a lot of production PostgreSQL environments, pooling is only one part of the story. Before long, teams are also adding pieces for: → read/write routing → failover awareness → traffic control and query rules → observability → caching That's usually when the conversation changes. You're no longer just choosing a pooler. You're deciding whether it still makes sense to keep all of that logic spread across scripts, sidecars, and application code, or move more of it into a proxy layer that was built to handle it. 📝 That's what Part 1 of this new series is about — where PgBouncer fits well, where the operational glue starts to build up, and why ProxySQL becomes worth a serious look once you need more control, visibility, and flexibility in front of PostgreSQL. 💬 If you're using PgBouncer today, what's the biggest thing you're still managing outside the pooler? 👉 Part 1 — PgBouncer to ProxySQL: Rethinking the PostgreSQL Middle Tier 🔗 https://lnkd.in/dbd_xRVg #PostgreSQL #ProxySQL #PgBouncer #DatabaseEngineering #SRE

Part 1 - PgBouncer to ProxySQL: Rethinking the PostgreSQL Middle Tier — ProxySQL Blog proxysql.com
Like Comment
To view or add a comment, sign in
MyDBA.dev

33 followers
3w
Report this post
A PostgreSQL table with 500 million rows doesn't just slow down queries. It slows down everything: VACUUM takes hours, index builds lock the table, and deleting old data generates massive WAL and leaves bloat behind. Table partitioning splits a large table into smaller physical pieces while keeping it as a single logical table. The query planner uses partition pruning to scan only the relevant partitions. Here's what matters in practice: 1. The partition key must appear in your most common WHERE clauses. If 90% of your queries filter on event_timestamp, partition by timestamp. A partition key that queries don't filter on provides zero benefit and only adds planning overhead. Always verify with EXPLAIN before committing to a scheme -- if pruning doesn't activate, the partitioning isn't helping. 2. Partition count matters more than people think. Monthly partitions for 5 years = 60 partitions (reasonable). Daily partitions for 5 years = 1,825 partitions (the planner slows down noticeably). The planner evaluates each partition during query planning. Keep it manageable, or use TimescaleDB which is specifically optimized for high partition counts. 3. The biggest operational win is instant data removal. DROP TABLE on a partition takes milliseconds and generates no WAL. Compare that to DELETE FROM events WHERE event_timestamp < '2024-02-01' on 100 million rows -- that takes an hour, generates massive WAL, and leaves dead tuples for VACUUM to clean up. The biggest gotcha: you can't convert an existing table to a partitioned table in place. You need to create a new partitioned table, migrate data in batches, and swap with a rename in a single transaction. For zero-downtime migrations, add a trigger or logical replication to capture writes during the migration. Automate partition lifecycle with pg_partman or cron. A missing future partition causes INSERT failures. A forgotten old partition wastes storage. Full guide with range, list, hash strategies, migration patterns, and pg_partman setup: https://lnkd.in/eGUj8zXC #PostgreSQL #TablePartitioning #DatabasePerformance #DataEngineering #DevOps #SRE
Like Comment
To view or add a comment, sign in

33 followers

View Profile Connect

PostgreSQL Slow Query Monitoring Best Practices

More Relevant Posts

Explore related topics

Explore content categories