Postgres DELETE doesn't remove from disk, VACUUM does

Deleting a row in Postgres doesn't remove it from disk. It marks it invisible and leaves the rest to VACUUM. Postgres just sets a single field on the tuple header and moves on. Here's what DELETE actually does: 1. Postgres locates the target tuple on its heap page. 2. It then sets xmax on that tuple to the current transaction ID. 3. That's it, the tuple stays on the page, unchanged, in place. 4. Future transactions check xmax during visibility evaluation. If it's set to a committed transaction, the row is invisible to them. 5. The space the tuple occupies is not freed, it remains on the page. 6. VACUUM is the only process that can reclaim that space and only after no active transaction can still see the deleted version. A table with heavy deletes doesn't shrink. It accumulates dead tuples which are invisible rows that still get read during sequential scans. A 10 million row table where 8 million rows were deleted is not a 2 million row table in terms of storage or scan cost. A DELETE tells Postgres a row is gone. VACUUM is what makes that true on disk. Most of us treat DELETE as an immediate removal, how long did it take you to unlearn that assumption in Postgres? #Postgres #PostgreSQL #BackendEngineering

2 Comments

Roman Rodionov 1w

Postgres can also reclaim dead tuples on a page through HOT (Heap-Only Tuples) pruning — so even a SELECT can update pages without waiting for vacuum.

1 Reaction

To view or add a comment, sign in

More Relevant Posts

Onkar Lapate
1w
Report this post
Updating a row in Postgres doesn't change it. It writes a brand new version and leaves the old one behind. Every UPDATE is internally two operations - the old tuple is marked dead and a new tuple is written with the changed values. Here's what happens inside every UPDATE: 1. Postgres locates the current tuple on its heap page. 2. The old tuple's xmax is set to the current transaction ID, marking it dead for future readers. 3. A new tuple is written with xmin set to the current transaction ID, xmax = 0. 4. If the new tuple fits on the same page, Postgres uses a HOT update (Heap Only Tuple) no index entries are touched. 5. If it doesn't fit, the new tuple lands on a different page and every index pointing to that row must be updated too. 6. The old tuple stays on disk until VACUUM reclaims it. This is why UPDATE-heavy tables bloat. Every change leaves a dead tuple behind. A row updated a thousand times has a thousand dead versions sitting in the heap. All invisible to queries, all consuming space, all getting scanned. How do you track UPDATE-driven bloat in production? Is it something your team monitors proactively? #Postgres #PostgreSQL #BackendEngineering
Like Comment
To view or add a comment, sign in
Onkar Lapate
1w
Report this post
Postgres doesn't know if a row is visible to your query by looking at the row. It knows by checking who wrote it and whether that transaction is still alive. Every visibility decision is a transaction ID lookup, evaluated fresh against the current snapshot. Here's how Postgres decides what you see: 1. Every transaction gets a monotonically increasing transaction ID (txid) at start. 2. At query time, Postgres builds a snapshot, capturing the current txid and the list of all in-progress transaction IDs. 3. For every tuple, Postgres evaluates two fields - xmin (who created it) and xmax (who deleted it). 4. xmin must belong to a committed transaction that was visible at snapshot time. Otherwise the tuple never existed for this query. 5. xmax must be 0 or belong to an aborted or not-yet-committed transaction, otherwise the tuple is deleted. 6. Commit status of every transaction ID is stored in pg_xact - a bitmap Postgres checks to confirm committed or aborted. This is why a rolled-back DELETE leaves the tuple fully intact. xmax is set but points to an aborted transaction. Postgres reads pg_xact, confirms the abort, and treats the tuple as live. The row isn't marked visible or invisible. Postgres works that out fresh, every time, for every tuple. #PostgreSQL #DatabaseEngineering #BackendEngineering
1 Comment
Like Comment
To view or add a comment, sign in
Onkar Lapate
2w
Report this post
A Postgres table is not a list of rows. It's a collection of 8KB pages and rows are just byte sequences packed inside them. Most engineers think of a table as an ordered structure - insert a row, it goes to the end. Postgres thinks in pages. Every table is a heap file, split into fixed 8KB blocks. Rows go wherever there's space. Every page consists of following data: 1. A page header - metadata about the page itself 2. Tuples - actual rows, packed from the bottom up 3. Item pointers - fixed slots at the top, each pointing to a tuple's location within the page 4. Every tuple carries its own header - which transaction inserted it, which deleted it, whether it's currently visible. When a page fills up, Postgres moves to the next one. No global directory tracks which row lives on which page and the heap is intentionally unordered. This is why a sequential scan reads every page regardless of how many rows match. Without an index, there's no way to skip pages. A table with heavy updates and deletes accumulates dead tuples. Old row versions are still sitting in pages, occupying space, getting scanned on every read. Physical layout of how rows are stored in Postgres directly affects the query cost. Storage internals almost never appear in Postgres tutorials. When was the first time a dead tuple problem you actually faced? #PostgreSQL #DatabaseEngineering #BackendEngineering
2 Comments
Like Comment
To view or add a comment, sign in
Andrea Casarin
4d
Report this post
I've been working on hybrid search in Postgres and wrote down the smallest version that still explains the idea. This first piece is about the setup: why I kept text and embeddings in the same database, why `pgvector` and `pg_trgm` were enough for this case, and what the basic write path looked like.

You Don’t Need a Vector DB: Setting up hybrid search in Postgres andreacasarin.com
Like Comment
To view or add a comment, sign in
Ulmanu Cristian
3w
Report this post
I put together a short presentation on intermediate PostgreSQL. The goal was to focus less on syntax and more on correctness, performance, and reasoning — the patterns that begin to matter when queries hit real traffic. It covers: 🎯 schema evolution 🎯 timestamps and time zones 🎯 DISTINCT and query shape 🎯 atomic updates and row locks 🎯 ON CONFLICT 🎯 transactions and SKIP LOCKED 🎯 indexing for text queries Would be curious which PostgreSQL topic gave you the biggest learning curve. #PostgreSQL #Database #Backend #Performance
Like Comment
To view or add a comment, sign in
Emeric Tabakhoff
1w
Report this post
Tudor Golubenco: Introducing Xata OSS: Postgres platform with branching, now Apache 2.0 🌟 Attention all PostgreSQL wizards! 🌟 Think you know everything about PostgreSQL? Maybe it’s time to think again! A standing ovation for Tudor Golubenco 👏he just dropped a killer development that's shaking up the database world. Enter: Xata OSS. Yes open-source folks the core is now flaunting an Apache 2.0 license because why not let everyone join the fun? ✨ Imagine having copy-on-write branching. Yes branching and PostgreSQL finally had a rendezvous. It's like having your cake cloning it and eating it too. No regrets. Plus say hello to scale-to-zero compute. Postgres gets a spa day and only works when needed. What’s next? Coffee breaks? The only downside… well there isn’t one yet. So dive into the magic of Xata and bask in the glory of Golubenco's genius. 🌟 #PostgreSQLOpenSource #XataMagic #TudorGolubenco #DatabasingLikeABoss P.S. Don’t worry if errors randomly pop up in this post... consider them as features not glitches.😜 You can find the full article here: https://postgr.es/p/8C_
Like Comment
To view or add a comment, sign in
Onkar Lapate
6d
Report this post
That 10MB JSON column you're querying? It's not in your table. Postgres moved it somewhere else the moment it was inserted. Postgres has an 8KB page size and a single row cannot exceed that boundary. TL;DR: Postgres stores large entries in TOAST table. Here's exactly what happens: 1. Every table with columns that can exceed ~2KB gets a dedicated TOAST table created automatically alongside it. 2. When a value crosses ~2KB, Postgres first tries to compress it using LZ compression. 3. If compression brings it under the threshold, it stays inline on the heap page. 4. If it's still too large, Postgres slices the value into 2KB chunks and stores them in the TOAST table, leaving only a small pointer on the original row. 5. On read, Postgres fetches and reassembles those chunks. This is why SELECT id, name FROM large_table is fast even when that table has a 10MB payload column. Postgres never fetches the TOAST chunks unless the query explicitly needs them. The cost surfaces when queries do need the large value. Every TOAST fetch is additional I/O and potentially many round trips to reassemble a single column value. TOAST is Postgres technique where some data simply doesn't belong on the same page as the row that owns it. #Postgres #BackendEngineering
1 Comment
Like Comment
To view or add a comment, sign in
Tagorenath V
2w
Report this post
PostgreSQL maintains two tiny files alongside every table that affect performance: Visibility Map & Free Space Map. Visibility Map (VM): - 2 bits per page: all_visible, all_frozen - all_visible decides whether a query touches disk - all_frozen decides whether VACUUM does - If all_visible = 1, PostgreSQL skips the heap entirely and serves the query from the index (index-only scan) - If all_visible = 0, it must fetch heap page, check row visibility via xmin/xmax - If all_frozen = 1, it means all rows frozen, VACUUM skips the page entirely (saves I/O) Free Space Map (FSM): - 1 byte per page (0–255) - On INSERT or UPDATE, PostgreSQL reads that byte to find a page with room, jumps directly to that page with available space - No byte, then no shortcut. It would scan the table looking for space The catch is that both maps go stale quickly due to PostgreSQL's MVCC nature: - UPDATE = INSERT + mark old row dead - DELETE = mark row dead, space not reusable yet - In both cases: VM bit cleared, FSM not updated until space is reclaimed VACUUM fixes all of this: - Removes dead tuples - Rewrites FSM bytes to reflect real free space - Sets all_visible = 1 when the page is clean - Sets all_frozen = 1 when all rows are old enough to freeze - Prevents transaction ID wraparound (after ~2B transactions, IDs wrap and frozen rows are immune) So, next time a query gets slow, don't just ask "is the query bad?" or "is the index covered?" Ask: - When did VACUUM last run on this table? - How many dead tuples are accumulating? - Are our index-only scans actually skipping the heap? #PostgreSQL #DatabaseInternals #BackendEngineering #Performance #pgpulse
8 Comments
Like Comment
To view or add a comment, sign in
Philip Johnston, PhD 🌍
2w
Report this post
"next time a query gets slow, don't just ask "is the query bad?" or "is the index covered?" instead, add this to your arsenal:
Tagorenath V

Distributed Systems | Real-Time Data Engineering
2w

PostgreSQL maintains two tiny files alongside every table that affect performance: Visibility Map & Free Space Map. Visibility Map (VM): - 2 bits per page: all_visible, all_frozen - all_visible decides whether a query touches disk - all_frozen decides whether VACUUM does - If all_visible = 1, PostgreSQL skips the heap entirely and serves the query from the index (index-only scan) - If all_visible = 0, it must fetch heap page, check row visibility via xmin/xmax - If all_frozen = 1, it means all rows frozen, VACUUM skips the page entirely (saves I/O) Free Space Map (FSM): - 1 byte per page (0–255) - On INSERT or UPDATE, PostgreSQL reads that byte to find a page with room, jumps directly to that page with available space - No byte, then no shortcut. It would scan the table looking for space The catch is that both maps go stale quickly due to PostgreSQL's MVCC nature: - UPDATE = INSERT + mark old row dead - DELETE = mark row dead, space not reusable yet - In both cases: VM bit cleared, FSM not updated until space is reclaimed VACUUM fixes all of this: - Removes dead tuples - Rewrites FSM bytes to reflect real free space - Sets all_visible = 1 when the page is clean - Sets all_frozen = 1 when all rows are old enough to freeze - Prevents transaction ID wraparound (after ~2B transactions, IDs wrap and frozen rows are immune) So, next time a query gets slow, don't just ask "is the query bad?" or "is the index covered?" Ask: - When did VACUUM last run on this table? - How many dead tuples are accumulating? - Are our index-only scans actually skipping the heap? #PostgreSQL #DatabaseInternals #BackendEngineering #Performance #pgpulse
Like Comment
To view or add a comment, sign in
pgpulse

61 followers
2w
Report this post
This is a great breakdown of how PostgreSQL actually behaves under the hood—especially Visibility Map(VM) & Free Space Map (FSM). What we consistently see in production is this: teams don’t struggle with understanding these concepts—they struggle with observing them in real time. Questions like: Are index-only scans actually skipping heap reads? Is autovacuum keeping up or silently falling behind? How quickly is XID age increasing? By the time performance drops, VM bits are already cleared, FSM is outdated, and bloat has started accumulating. pgpulse is designed to surface these internals—VM, FSM, vacuum behavior, bloat, and more—in real-time, before they become production issues. #pgpulse #supabase #databases #postgres #database #performance
Tagorenath V

Distributed Systems | Real-Time Data Engineering
2w

PostgreSQL maintains two tiny files alongside every table that affect performance: Visibility Map & Free Space Map. Visibility Map (VM): - 2 bits per page: all_visible, all_frozen - all_visible decides whether a query touches disk - all_frozen decides whether VACUUM does - If all_visible = 1, PostgreSQL skips the heap entirely and serves the query from the index (index-only scan) - If all_visible = 0, it must fetch heap page, check row visibility via xmin/xmax - If all_frozen = 1, it means all rows frozen, VACUUM skips the page entirely (saves I/O) Free Space Map (FSM): - 1 byte per page (0–255) - On INSERT or UPDATE, PostgreSQL reads that byte to find a page with room, jumps directly to that page with available space - No byte, then no shortcut. It would scan the table looking for space The catch is that both maps go stale quickly due to PostgreSQL's MVCC nature: - UPDATE = INSERT + mark old row dead - DELETE = mark row dead, space not reusable yet - In both cases: VM bit cleared, FSM not updated until space is reclaimed VACUUM fixes all of this: - Removes dead tuples - Rewrites FSM bytes to reflect real free space - Sets all_visible = 1 when the page is clean - Sets all_frozen = 1 when all rows are old enough to freeze - Prevents transaction ID wraparound (after ~2B transactions, IDs wrap and frozen rows are immune) So, next time a query gets slow, don't just ask "is the query bad?" or "is the index covered?" Ask: - When did VACUUM last run on this table? - How many dead tuples are accumulating? - Are our index-only scans actually skipping the heap? #PostgreSQL #DatabaseInternals #BackendEngineering #Performance #pgpulse
Like Comment
To view or add a comment, sign in

4,577 followers

151 Posts

View Profile Follow

Postgres DELETE doesn't remove from disk, VACUUM does

More Relevant Posts

Explore content categories