Postgres Visibility: Transaction ID Lookup

Postgres doesn't know if a row is visible to your query by looking at the row. It knows by checking who wrote it and whether that transaction is still alive. Every visibility decision is a transaction ID lookup, evaluated fresh against the current snapshot. Here's how Postgres decides what you see: 1. Every transaction gets a monotonically increasing transaction ID (txid) at start. 2. At query time, Postgres builds a snapshot, capturing the current txid and the list of all in-progress transaction IDs. 3. For every tuple, Postgres evaluates two fields - xmin (who created it) and xmax (who deleted it). 4. xmin must belong to a committed transaction that was visible at snapshot time. Otherwise the tuple never existed for this query. 5. xmax must be 0 or belong to an aborted or not-yet-committed transaction, otherwise the tuple is deleted. 6. Commit status of every transaction ID is stored in pg_xact - a bitmap Postgres checks to confirm committed or aborted. This is why a rolled-back DELETE leaves the tuple fully intact. xmax is set but points to an aborted transaction. Postgres reads pg_xact, confirms the abort, and treats the tuple as live. The row isn't marked visible or invisible. Postgres works that out fresh, every time, for every tuple. #PostgreSQL #DatabaseEngineering #BackendEngineering

1 Comment

Onkar Lapate 1w

Have you ever used pg_xact or transaction ID information directly to diagnose a visibility or bloat issue in production?

To view or add a comment, sign in

More Relevant Posts

Onkar Lapate
1w
Report this post
Deleting a row in Postgres doesn't remove it from disk. It marks it invisible and leaves the rest to VACUUM. Postgres just sets a single field on the tuple header and moves on. Here's what DELETE actually does: 1. Postgres locates the target tuple on its heap page. 2. It then sets xmax on that tuple to the current transaction ID. 3. That's it, the tuple stays on the page, unchanged, in place. 4. Future transactions check xmax during visibility evaluation. If it's set to a committed transaction, the row is invisible to them. 5. The space the tuple occupies is not freed, it remains on the page. 6. VACUUM is the only process that can reclaim that space and only after no active transaction can still see the deleted version. A table with heavy deletes doesn't shrink. It accumulates dead tuples which are invisible rows that still get read during sequential scans. A 10 million row table where 8 million rows were deleted is not a 2 million row table in terms of storage or scan cost. A DELETE tells Postgres a row is gone. VACUUM is what makes that true on disk. Most of us treat DELETE as an immediate removal, how long did it take you to unlearn that assumption in Postgres? #Postgres #PostgreSQL #BackendEngineering
2 Comments
Like Comment
To view or add a comment, sign in
Waqas Dost
1w
Report this post
I used JSON in PostgreSQL for 6 months before someone told me I was doing it wrong. Here is what nobody explains clearly: JSON and JSONB are not the same type with different names. They are completely different storage mechanisms. JSON stores your data as plain text. Every time you query it, PostgreSQL re-parses the entire string. You cannot put an index on it. On a table with 1 million rows, that query takes 4 seconds. JSONB stores your data as binary. Parsed once on write. Never again. You can put a GIN index on any key inside the blob. Same query on the same 1 million rows takes 4 milliseconds. That is not a typo. 4 seconds vs 4 milliseconds. 1000x faster. One word changed. There is one other difference nobody mentions: JSON keeps duplicate keys and preserves key order. Sounds useful. It is not. Key order in JSON is meaningless. Duplicate keys make your data silently inconsistent. JSONB deduplicates keys automatically. Last value wins. Clean data. Every time. The only reason to use JSON over JSONB is if you genuinely need to preserve exact whitespace and key ordering in your output. In 3 years of PostgreSQL work, I have needed that exactly zero times. Default to JSONB. Add a GIN index on the columns you query. Watch your slow JSON queries disappear. #JSON #JSONB #DatabaseOptimization #QueryOptimization #PerformanceTuning #DataModeling #PostgreSQL #Database #DataEngineering #Backend #JSON #GIS
Like Comment
To view or add a comment, sign in
Onkar Lapate
1w
Report this post
Updating a row in Postgres doesn't change it. It writes a brand new version and leaves the old one behind. Every UPDATE is internally two operations - the old tuple is marked dead and a new tuple is written with the changed values. Here's what happens inside every UPDATE: 1. Postgres locates the current tuple on its heap page. 2. The old tuple's xmax is set to the current transaction ID, marking it dead for future readers. 3. A new tuple is written with xmin set to the current transaction ID, xmax = 0. 4. If the new tuple fits on the same page, Postgres uses a HOT update (Heap Only Tuple) no index entries are touched. 5. If it doesn't fit, the new tuple lands on a different page and every index pointing to that row must be updated too. 6. The old tuple stays on disk until VACUUM reclaims it. This is why UPDATE-heavy tables bloat. Every change leaves a dead tuple behind. A row updated a thousand times has a thousand dead versions sitting in the heap. All invisible to queries, all consuming space, all getting scanned. How do you track UPDATE-driven bloat in production? Is it something your team monitors proactively? #Postgres #PostgreSQL #BackendEngineering
Like Comment
To view or add a comment, sign in
Onkar Lapate
1w
Report this post
A Postgres table is not a list of rows. It's a collection of 8KB pages and rows are just byte sequences packed inside them. Most engineers think of a table as an ordered structure - insert a row, it goes to the end. Postgres thinks in pages. Every table is a heap file, split into fixed 8KB blocks. Rows go wherever there's space. Every page consists of following data: 1. A page header - metadata about the page itself 2. Tuples - actual rows, packed from the bottom up 3. Item pointers - fixed slots at the top, each pointing to a tuple's location within the page 4. Every tuple carries its own header - which transaction inserted it, which deleted it, whether it's currently visible. When a page fills up, Postgres moves to the next one. No global directory tracks which row lives on which page and the heap is intentionally unordered. This is why a sequential scan reads every page regardless of how many rows match. Without an index, there's no way to skip pages. A table with heavy updates and deletes accumulates dead tuples. Old row versions are still sitting in pages, occupying space, getting scanned on every read. Physical layout of how rows are stored in Postgres directly affects the query cost. Storage internals almost never appear in Postgres tutorials. When was the first time a dead tuple problem you actually faced? #PostgreSQL #DatabaseEngineering #BackendEngineering
2 Comments
Like Comment
To view or add a comment, sign in
Alexey Orlov
1mo
Report this post
Why Did PostgreSQL Consume 16 GB If work_mem Was Set to 64 MB? The answer from a PostgreSQL core investigation by Thomas Vondra (pgsql-hackers): What actually happened: The query plan contained a Hash Join node with a huge expected row count and large width. PostgreSQL tried to stay within work_mem, but Hash Join realized the data wouldn't fit in the allocated 64 MB. To solve this, it started splitting data into smaller chunks (batches) and spilling them to temp files on disk. Due to the enormous data volume, PostgreSQL created 1 million batches. Here's the math: 1 million batches = 1 million temp files for the hash table + 1 million files for the outer relation. Total: 2 million open BufFile objects. Each file gets an 8 KB buffer allocated in RAM. 2,000,000 × 8 KB = ~16 GB of memory! This behavior can become a cause of OOM killer on any PostgreSQL below version 16. The fix (committed to PostgreSQL 16+): https://lnkd.in/dGGq9WC3

Allocate hash join files in a separate memory context · postgres/postgres@8c4040e github.com

5 Comments
Like Comment
To view or add a comment, sign in
Tagorenath V
2w
Report this post
PostgreSQL maintains two tiny files alongside every table that affect performance: Visibility Map & Free Space Map. Visibility Map (VM): - 2 bits per page: all_visible, all_frozen - all_visible decides whether a query touches disk - all_frozen decides whether VACUUM does - If all_visible = 1, PostgreSQL skips the heap entirely and serves the query from the index (index-only scan) - If all_visible = 0, it must fetch heap page, check row visibility via xmin/xmax - If all_frozen = 1, it means all rows frozen, VACUUM skips the page entirely (saves I/O) Free Space Map (FSM): - 1 byte per page (0–255) - On INSERT or UPDATE, PostgreSQL reads that byte to find a page with room, jumps directly to that page with available space - No byte, then no shortcut. It would scan the table looking for space The catch is that both maps go stale quickly due to PostgreSQL's MVCC nature: - UPDATE = INSERT + mark old row dead - DELETE = mark row dead, space not reusable yet - In both cases: VM bit cleared, FSM not updated until space is reclaimed VACUUM fixes all of this: - Removes dead tuples - Rewrites FSM bytes to reflect real free space - Sets all_visible = 1 when the page is clean - Sets all_frozen = 1 when all rows are old enough to freeze - Prevents transaction ID wraparound (after ~2B transactions, IDs wrap and frozen rows are immune) So, next time a query gets slow, don't just ask "is the query bad?" or "is the index covered?" Ask: - When did VACUUM last run on this table? - How many dead tuples are accumulating? - Are our index-only scans actually skipping the heap? #PostgreSQL #DatabaseInternals #BackendEngineering #Performance #pgpulse
8 Comments
Like Comment
To view or add a comment, sign in
Philip Johnston, PhD 🌍
2w
Report this post
"next time a query gets slow, don't just ask "is the query bad?" or "is the index covered?" instead, add this to your arsenal:
Tagorenath V

Distributed Systems | Real-Time Data Engineering
2w

PostgreSQL maintains two tiny files alongside every table that affect performance: Visibility Map & Free Space Map. Visibility Map (VM): - 2 bits per page: all_visible, all_frozen - all_visible decides whether a query touches disk - all_frozen decides whether VACUUM does - If all_visible = 1, PostgreSQL skips the heap entirely and serves the query from the index (index-only scan) - If all_visible = 0, it must fetch heap page, check row visibility via xmin/xmax - If all_frozen = 1, it means all rows frozen, VACUUM skips the page entirely (saves I/O) Free Space Map (FSM): - 1 byte per page (0–255) - On INSERT or UPDATE, PostgreSQL reads that byte to find a page with room, jumps directly to that page with available space - No byte, then no shortcut. It would scan the table looking for space The catch is that both maps go stale quickly due to PostgreSQL's MVCC nature: - UPDATE = INSERT + mark old row dead - DELETE = mark row dead, space not reusable yet - In both cases: VM bit cleared, FSM not updated until space is reclaimed VACUUM fixes all of this: - Removes dead tuples - Rewrites FSM bytes to reflect real free space - Sets all_visible = 1 when the page is clean - Sets all_frozen = 1 when all rows are old enough to freeze - Prevents transaction ID wraparound (after ~2B transactions, IDs wrap and frozen rows are immune) So, next time a query gets slow, don't just ask "is the query bad?" or "is the index covered?" Ask: - When did VACUUM last run on this table? - How many dead tuples are accumulating? - Are our index-only scans actually skipping the heap? #PostgreSQL #DatabaseInternals #BackendEngineering #Performance #pgpulse
Like Comment
To view or add a comment, sign in
pgpulse

61 followers
2w
Report this post
This is a great breakdown of how PostgreSQL actually behaves under the hood—especially Visibility Map(VM) & Free Space Map (FSM). What we consistently see in production is this: teams don’t struggle with understanding these concepts—they struggle with observing them in real time. Questions like: Are index-only scans actually skipping heap reads? Is autovacuum keeping up or silently falling behind? How quickly is XID age increasing? By the time performance drops, VM bits are already cleared, FSM is outdated, and bloat has started accumulating. pgpulse is designed to surface these internals—VM, FSM, vacuum behavior, bloat, and more—in real-time, before they become production issues. #pgpulse #supabase #databases #postgres #database #performance
Tagorenath V

Distributed Systems | Real-Time Data Engineering
2w

PostgreSQL maintains two tiny files alongside every table that affect performance: Visibility Map & Free Space Map. Visibility Map (VM): - 2 bits per page: all_visible, all_frozen - all_visible decides whether a query touches disk - all_frozen decides whether VACUUM does - If all_visible = 1, PostgreSQL skips the heap entirely and serves the query from the index (index-only scan) - If all_visible = 0, it must fetch heap page, check row visibility via xmin/xmax - If all_frozen = 1, it means all rows frozen, VACUUM skips the page entirely (saves I/O) Free Space Map (FSM): - 1 byte per page (0–255) - On INSERT or UPDATE, PostgreSQL reads that byte to find a page with room, jumps directly to that page with available space - No byte, then no shortcut. It would scan the table looking for space The catch is that both maps go stale quickly due to PostgreSQL's MVCC nature: - UPDATE = INSERT + mark old row dead - DELETE = mark row dead, space not reusable yet - In both cases: VM bit cleared, FSM not updated until space is reclaimed VACUUM fixes all of this: - Removes dead tuples - Rewrites FSM bytes to reflect real free space - Sets all_visible = 1 when the page is clean - Sets all_frozen = 1 when all rows are old enough to freeze - Prevents transaction ID wraparound (after ~2B transactions, IDs wrap and frozen rows are immune) So, next time a query gets slow, don't just ask "is the query bad?" or "is the index covered?" Ask: - When did VACUUM last run on this table? - How many dead tuples are accumulating? - Are our index-only scans actually skipping the heap? #PostgreSQL #DatabaseInternals #BackendEngineering #Performance #pgpulse
Like Comment
To view or add a comment, sign in
Pavan Kumar Bandaru
2w Edited
Report this post
#postgres users, interesting detailing on what a VACCUM can do for the performance improvements of your databases, I liked the three important questions you need to ask whenever your DB has slow queries 𝗪𝗵𝗲𝗻 𝗱𝗶𝗱 𝗩𝗔𝗖𝗨𝗨𝗠 𝗹𝗮𝘀𝘁 𝗿𝘂𝗻 𝗼𝗻 𝘁𝗵𝗶𝘀 𝘁𝗮𝗯𝗹𝗲? 𝗛𝗼𝘄 𝗺𝗮𝗻𝘆 𝗱𝗲𝗮𝗱 𝘁𝘂𝗽𝗹𝗲𝘀 𝗮𝗿𝗲 𝗮𝗰𝗰𝘂𝗺𝘂𝗹𝗮𝘁𝗶𝗻𝗴? 𝗔𝗿𝗲 𝗼𝘂𝗿 𝗶𝗻𝗱𝗲𝘅-𝗼𝗻𝗹𝘆 𝘀𝗰𝗮𝗻𝘀 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝘀𝗸𝗶𝗽𝗽𝗶𝗻𝗴 𝘁𝗵𝗲 𝗵𝗲𝗮𝗽? #postgres #pgpulse #supabase #neon #railway #DatabaseInternals #BackendEngineering #Performance
Tagorenath V

Distributed Systems | Real-Time Data Engineering
2w

PostgreSQL maintains two tiny files alongside every table that affect performance: Visibility Map & Free Space Map. Visibility Map (VM): - 2 bits per page: all_visible, all_frozen - all_visible decides whether a query touches disk - all_frozen decides whether VACUUM does - If all_visible = 1, PostgreSQL skips the heap entirely and serves the query from the index (index-only scan) - If all_visible = 0, it must fetch heap page, check row visibility via xmin/xmax - If all_frozen = 1, it means all rows frozen, VACUUM skips the page entirely (saves I/O) Free Space Map (FSM): - 1 byte per page (0–255) - On INSERT or UPDATE, PostgreSQL reads that byte to find a page with room, jumps directly to that page with available space - No byte, then no shortcut. It would scan the table looking for space The catch is that both maps go stale quickly due to PostgreSQL's MVCC nature: - UPDATE = INSERT + mark old row dead - DELETE = mark row dead, space not reusable yet - In both cases: VM bit cleared, FSM not updated until space is reclaimed VACUUM fixes all of this: - Removes dead tuples - Rewrites FSM bytes to reflect real free space - Sets all_visible = 1 when the page is clean - Sets all_frozen = 1 when all rows are old enough to freeze - Prevents transaction ID wraparound (after ~2B transactions, IDs wrap and frozen rows are immune) So, next time a query gets slow, don't just ask "is the query bad?" or "is the index covered?" Ask: - When did VACUUM last run on this table? - How many dead tuples are accumulating? - Are our index-only scans actually skipping the heap? #PostgreSQL #DatabaseInternals #BackendEngineering #Performance #pgpulse
Like Comment
To view or add a comment, sign in
Onkar Lapate
1w
Report this post
Two queries hit the same row at the same time. One is writing. The other reads instantly - no wait, no lock, no queue. Most databases can't do that. Postgres can because of MultiVersion Concurrency Control (MVCC)! In Postgres, readers never wait for writers and writers never wait for readers. Because they're not looking at the same version of the data. Here's how Postgres pulls this off: 1. Every transaction gets a transaction ID (txid) at start. 2. When a query runs, Postgres takes a snapshot of which transactions are committed, in-progress or not yet started at that exact moment. 3. For every tuple it encounters, Postgres checks xmin and xmax against that snapshot. 4. If xmin is committed and visible to the snapshot and xmax is 0 or not yet committed, the tuple is visible. 5. A writer creating a new tuple version doesn't touch what the reader sees and both versions coexist on the heap simultaneously. This is why long-running transactions are dangerous. The snapshot they hold stays open and VACUUM won't touch any dead tuple that snapshot can still see. One idle transaction silently stalls cleanup for the entire database! Consistency in Postgres isn't enforced with locks. It's enforced with versions. #PostgreSQL #DatabaseEngineering #BackendEngineering #SoftwareArchitecture
4 Comments
Like Comment
To view or add a comment, sign in

4,575 followers

151 Posts

View Profile Follow

Postgres Visibility: Transaction ID Lookup

More Relevant Posts

Explore content categories