Two Kinds of Database Performance Problems

Two Kinds of Database Performance Problems

I've sat in a lot of post-mortem meetings where someone says "we just need to optimize better."

Usually they're wrong.

Database performance problems come in two completely different categories. They look identical until you've already spent months on the wrong approach.

The first kind responds to optimization. You add an index, queries speed up, the improvement holds. Done.

The second kind responds to optimization temporarily. You add the index. Queries speed up. Three months later you're back in the same meeting, staring at the same metrics trending in the same direction.

That pattern is the signal. I call it the architectural boundary. Cross it without recognizing it, and you can optimize forever without actually solving the problem.

How to tell which one you're dealing with

The most reliable test isn't any single fix. It's the pattern across fixes.

Think about the last 12 months of performance work on your system. Are you solving different problems each time, or variations of the same problem? Are fixes holding, or buying progressively less runway? Is the interval between optimization cycles getting shorter?

Diverse symptoms, targeted fixes, improvements that hold: you're in the right architecture. Keep going.

Same symptoms recurring, each fix buying less time than the last, the cycle accelerating: you've crossed the architectural boundary. The optimization is working. It's just fighting a battle it can't win.

I call this the recurrence test. It gives you language for what you already know, so you can say it out loud in a meeting.

Why certain workloads hit this wall

Not every system runs into this. But there's a specific class of workload where it's almost inevitable: high-volume, append-heavy data that gets queried analytically. Time-series telemetry. Financial tick data. Operational metrics. Any system where data accumulates continuously and queries shift from point lookups to scans and aggregates on live data.

For this workload pattern, the mismatch between your architecture and your requirements runs three layers deep.

Layer one: you're reading data you'll never use. When your dominant query pattern is analytical scans across time ranges, and your storage model packs all columns together in rows, no index strategy resolves that. A query scanning 50 million rows to pull two columns reads every other column too, on every single row. That's not a configuration problem. That's physics.

Layer two: you're paying for features you never use. MVCC keeps concurrent reads and writes correct. Extremely valuable for transactional data. But if you're inserting 50,000 rows per second of data that will never be updated, you're paying the full cost of that concurrency model on every insert. Each row carries 23 bytes of transaction metadata. At 50K inserts per second, that's overhead on 4.3 billion rows per day that will never be modified. No configuration setting removes that structural cost.

Layer three: the maintenance never stops growing. Autovacuum, ANALYZE, background statistics collection. As data volume grows, the time these tasks take grows with them. Tuning adjusts priority, not necessity. The work still has to happen, and it competes with your production workload in ways that compound indefinitely.

This is the core distinction. Operational problems respond to tuning. Architectural problems compound with every additional row you write.

The cost of getting this wrong

Both failure modes are expensive, but they're not symmetrical.

Migrate when optimization would have been sufficient: you've spent two to eight weeks on an unnecessary project. Bad. But bounded.

Keep optimizing when the architecture is the actual constraint: the cost compounds. Teams working through this pattern typically lose months per year to ongoing optimization cycles. And the migration they eventually do gets more expensive the longer they wait.

At 10 million rows, migration takes days. At 500 million, weeks. At a billion plus, you're looking at months. Sigh.

Every quarter spent on optimizations that aren't sticking is paying twice: once for the engineering time, and once in increased migration cost when you eventually get there.

What the right architecture actually changes

If your workload is high-volume, append-heavy, and primarily analytical, the answer doesn't require leaving Postgres. It requires extending Postgres with primitives built for that specific pattern.

TimescaleDB addresses each of those three failure modes directly.

Hypercore, TimescaleDB's hybrid row/columnar engine, resolves the storage mismatch. Recent data stays in row format for writes. Older data automatically converts to columnar format for analytical scans. A query scanning millions of rows to extract two columns reads only those columns, from compressed columnar chunks.

Hypertables and chunk exclusion address the MVCC overhead. Hypertables partition data automatically by time. Chunk exclusion lets the query planner skip entire time partitions without scanning them. Query latency stays bounded as data grows, not because you tuned something, but because the planner knows which chunks are irrelevant. Compressed chunks also dramatically reduce autovacuum load.

Continuous aggregates handle the maintenance scaling: incremental materialized views that refresh in the background, updating only what changed since the last refresh. Dashboards stay fast without batch jobs or stale results.

The optimization work that was recurring becomes unnecessary because the architecture handles it at the right layer.

And it's still Postgres. Same SQL. Same extensions. Same tooling your team already knows.

The decision

Optimization and architecture solve different categories of problems. The skill is knowing which category you're in before you've invested months in the wrong one.

If your optimization work is targeted, diverse, and the fixes hold, you're in the right architecture. Keep going.

If the recurrence test comes back positive, same symptoms, shrinking runway, accelerating cycles, the architecture is the constraint. One more optimization buys a quarter. The right architectural change buys years.

That gap is the decision.

I've been thinking through this pattern for a while and finally wrote it up: https://b.link/7nrf32cn Covers the recurrence test, the three layers of architectural mismatch for append-heavy workloads, and what actually changes when you fix the architecture. Worth a read if you've ever closed a performance ticket and watched the same problem come back three months later.

To view or add a comment, sign in

More articles by Matty Stratton

Others also viewed

Explore content categories