Postgres Table Storage: Pages and Tuples

A Postgres table is not a list of rows. It's a collection of 8KB pages and rows are just byte sequences packed inside them. Most engineers think of a table as an ordered structure - insert a row, it goes to the end. Postgres thinks in pages. Every table is a heap file, split into fixed 8KB blocks. Rows go wherever there's space. Every page consists of following data: 1. A page header - metadata about the page itself 2. Tuples - actual rows, packed from the bottom up 3. Item pointers - fixed slots at the top, each pointing to a tuple's location within the page 4. Every tuple carries its own header - which transaction inserted it, which deleted it, whether it's currently visible. When a page fills up, Postgres moves to the next one. No global directory tracks which row lives on which page and the heap is intentionally unordered. This is why a sequential scan reads every page regardless of how many rows match. Without an index, there's no way to skip pages. A table with heavy updates and deletes accumulates dead tuples. Old row versions are still sitting in pages, occupying space, getting scanned on every read. Physical layout of how rows are stored in Postgres directly affects the query cost. Storage internals almost never appear in Postgres tutorials. When was the first time a dead tuple problem you actually faced? #PostgreSQL #DatabaseEngineering #BackendEngineering

2 Comments

Parth Belwate 2w

Insightful 💯

1 Reaction

To view or add a comment, sign in

More Relevant Posts

Andrea Casarin
3d
Report this post
I've been working on hybrid search in Postgres and wrote down the smallest version that still explains the idea. This first piece is about the setup: why I kept text and embeddings in the same database, why `pgvector` and `pg_trgm` were enough for this case, and what the basic write path looked like.

You Don’t Need a Vector DB: Setting up hybrid search in Postgres andreacasarin.com
Like Comment
To view or add a comment, sign in
Onkar Lapate
6d
Report this post
That 10MB JSON column you're querying? It's not in your table. Postgres moved it somewhere else the moment it was inserted. Postgres has an 8KB page size and a single row cannot exceed that boundary. TL;DR: Postgres stores large entries in TOAST table. Here's exactly what happens: 1. Every table with columns that can exceed ~2KB gets a dedicated TOAST table created automatically alongside it. 2. When a value crosses ~2KB, Postgres first tries to compress it using LZ compression. 3. If compression brings it under the threshold, it stays inline on the heap page. 4. If it's still too large, Postgres slices the value into 2KB chunks and stores them in the TOAST table, leaving only a small pointer on the original row. 5. On read, Postgres fetches and reassembles those chunks. This is why SELECT id, name FROM large_table is fast even when that table has a 10MB payload column. Postgres never fetches the TOAST chunks unless the query explicitly needs them. The cost surfaces when queries do need the large value. Every TOAST fetch is additional I/O and potentially many round trips to reassemble a single column value. TOAST is Postgres technique where some data simply doesn't belong on the same page as the row that owns it. #Postgres #BackendEngineering
1 Comment
Like Comment
To view or add a comment, sign in
Onkar Lapate
5d
Report this post
An index doesn't tell Postgres where your row is. It gives Postgres a starting point and then Postgres walks a tree to find it. Indexes are multi-level structures where every lookup is a traversal and the speed comes from how few levels that traversal takes. Here's what a B-Tree index looks like physically: 1. The index is a file of 8KB pages just like the heap, but structured differently. 2. At the top sits the root page, a single page containing keys and pointers to pages below. 3. Below the root are internal pages, each holds a sorted range of keys and pointers to the next level down. 4. At the bottom sit leaf pages, these hold the actual index keys and their corresponding ctid values. They are pointing to the exact heap page and slot where the row lives. 5. Every leaf page holds a pointer to the next leaf page forming a linked list across the bottom of the tree. When Postgres executes an index scan, it starts at the root, compares the search key against the keys on each internal page, follows the right pointer downward, and lands on the correct leaf page. From there, the ctid tells it exactly which heap page to fetch. A typical B-Tree on a large table is only 3–4 levels deep. That's 3–4 page reads to locate any row, regardless of table size. The speed of an index isn't magic. It's the result of needing to read only 3 or 4 pages out of millions to find exactly what you're looking for. #PostgreSQL #BackendEngineering #Indexes
2 Comments
Like Comment
To view or add a comment, sign in
Thanga samy S
4w
Report this post
🚀 What *actually* happens inside PostgreSQL when you run a query? Most developers use PostgreSQL daily… but very few understand what’s happening behind the scenes. Here’s a simple breakdown 👇 🟢 1. You run a query Example: SELECT * FROM users WHERE id = 1; 🧠 2. Query Parser kicks in PostgreSQL checks: * Is the SQL valid? * Do tables/columns exist? ⚙️ 3. Planner / Optimizer This is where the magic happens ✨ Postgres decides: * Use index? 🔍 * Or scan full table? 📦 🚀 4. Executor runs the query Now PostgreSQL actually fetches data. ⚡ 5. Memory check (super important) * If data is in cache (shared buffers) → FAST ⚡ * If not → load from disk → slower 💾 6. Disk access Data is stored in pages (8KB blocks) Postgres reads only required pages — not entire table 👌 🔁 7. For WRITE queries (INSERT / UPDATE / DELETE) Example: UPDATE users SET age = 30 WHERE id = 1; Here’s what REALLY happens: 1. Data updated in memory 2. Change written to WAL (Write-Ahead Log) 🧾 3. Success response sent ✅ 4. Actual disk write happens later 👉 This is called: “Log first, write later” 🛡️ 8. Crash Safety (WAL) If system crashes 💥 Postgres: * Replays WAL logs * Restores data No data loss 🔥 🔄 9. MVCC (Concurrency magic) Instead of overwriting: Old row → stays New row → created So: * Readers don’t block writers * Writers don’t block readers 🧹 10. VACUUM (cleanup crew) All old rows = “dead tuples” Postgres cleans them using: VACUUM / Autovacuum 🤖 🎯 Final Flow (Simple) Query → Parser → Planner → Executor → Memory → Disk → WAL → Result 💡 Takeaway: PostgreSQL is not just a database — it’s a highly optimized system with: ✔ Smart query planning ✔ Efficient caching ✔ Crash recovery (WAL) ✔ High concurrency (MVCC) If you’re preparing for backend/system design interviews… understanding this gives you a HUGE edge. #PostgreSQL #BackendEngineering #SystemDesign #Databases #SoftwareEngineering #plpgsql

1 Comment
Like Comment
To view or add a comment, sign in
Onkar Lapate
1w
Report this post
Postgres doesn't know if a row is visible to your query by looking at the row. It knows by checking who wrote it and whether that transaction is still alive. Every visibility decision is a transaction ID lookup, evaluated fresh against the current snapshot. Here's how Postgres decides what you see: 1. Every transaction gets a monotonically increasing transaction ID (txid) at start. 2. At query time, Postgres builds a snapshot, capturing the current txid and the list of all in-progress transaction IDs. 3. For every tuple, Postgres evaluates two fields - xmin (who created it) and xmax (who deleted it). 4. xmin must belong to a committed transaction that was visible at snapshot time. Otherwise the tuple never existed for this query. 5. xmax must be 0 or belong to an aborted or not-yet-committed transaction, otherwise the tuple is deleted. 6. Commit status of every transaction ID is stored in pg_xact - a bitmap Postgres checks to confirm committed or aborted. This is why a rolled-back DELETE leaves the tuple fully intact. xmax is set but points to an aborted transaction. Postgres reads pg_xact, confirms the abort, and treats the tuple as live. The row isn't marked visible or invisible. Postgres works that out fresh, every time, for every tuple. #PostgreSQL #DatabaseEngineering #BackendEngineering
1 Comment
Like Comment
To view or add a comment, sign in
Tagorenath V
2w
Report this post
PostgreSQL maintains two tiny files alongside every table that affect performance: Visibility Map & Free Space Map. Visibility Map (VM): - 2 bits per page: all_visible, all_frozen - all_visible decides whether a query touches disk - all_frozen decides whether VACUUM does - If all_visible = 1, PostgreSQL skips the heap entirely and serves the query from the index (index-only scan) - If all_visible = 0, it must fetch heap page, check row visibility via xmin/xmax - If all_frozen = 1, it means all rows frozen, VACUUM skips the page entirely (saves I/O) Free Space Map (FSM): - 1 byte per page (0–255) - On INSERT or UPDATE, PostgreSQL reads that byte to find a page with room, jumps directly to that page with available space - No byte, then no shortcut. It would scan the table looking for space The catch is that both maps go stale quickly due to PostgreSQL's MVCC nature: - UPDATE = INSERT + mark old row dead - DELETE = mark row dead, space not reusable yet - In both cases: VM bit cleared, FSM not updated until space is reclaimed VACUUM fixes all of this: - Removes dead tuples - Rewrites FSM bytes to reflect real free space - Sets all_visible = 1 when the page is clean - Sets all_frozen = 1 when all rows are old enough to freeze - Prevents transaction ID wraparound (after ~2B transactions, IDs wrap and frozen rows are immune) So, next time a query gets slow, don't just ask "is the query bad?" or "is the index covered?" Ask: - When did VACUUM last run on this table? - How many dead tuples are accumulating? - Are our index-only scans actually skipping the heap? #PostgreSQL #DatabaseInternals #BackendEngineering #Performance #pgpulse
8 Comments
Like Comment
To view or add a comment, sign in
Philip Johnston, PhD 🌍
2w
Report this post
"next time a query gets slow, don't just ask "is the query bad?" or "is the index covered?" instead, add this to your arsenal:
Tagorenath V

Distributed Systems | Real-Time Data Engineering
2w

PostgreSQL maintains two tiny files alongside every table that affect performance: Visibility Map & Free Space Map. Visibility Map (VM): - 2 bits per page: all_visible, all_frozen - all_visible decides whether a query touches disk - all_frozen decides whether VACUUM does - If all_visible = 1, PostgreSQL skips the heap entirely and serves the query from the index (index-only scan) - If all_visible = 0, it must fetch heap page, check row visibility via xmin/xmax - If all_frozen = 1, it means all rows frozen, VACUUM skips the page entirely (saves I/O) Free Space Map (FSM): - 1 byte per page (0–255) - On INSERT or UPDATE, PostgreSQL reads that byte to find a page with room, jumps directly to that page with available space - No byte, then no shortcut. It would scan the table looking for space The catch is that both maps go stale quickly due to PostgreSQL's MVCC nature: - UPDATE = INSERT + mark old row dead - DELETE = mark row dead, space not reusable yet - In both cases: VM bit cleared, FSM not updated until space is reclaimed VACUUM fixes all of this: - Removes dead tuples - Rewrites FSM bytes to reflect real free space - Sets all_visible = 1 when the page is clean - Sets all_frozen = 1 when all rows are old enough to freeze - Prevents transaction ID wraparound (after ~2B transactions, IDs wrap and frozen rows are immune) So, next time a query gets slow, don't just ask "is the query bad?" or "is the index covered?" Ask: - When did VACUUM last run on this table? - How many dead tuples are accumulating? - Are our index-only scans actually skipping the heap? #PostgreSQL #DatabaseInternals #BackendEngineering #Performance #pgpulse
Like Comment
To view or add a comment, sign in
pgpulse

61 followers
2w
Report this post
This is a great breakdown of how PostgreSQL actually behaves under the hood—especially Visibility Map(VM) & Free Space Map (FSM). What we consistently see in production is this: teams don’t struggle with understanding these concepts—they struggle with observing them in real time. Questions like: Are index-only scans actually skipping heap reads? Is autovacuum keeping up or silently falling behind? How quickly is XID age increasing? By the time performance drops, VM bits are already cleared, FSM is outdated, and bloat has started accumulating. pgpulse is designed to surface these internals—VM, FSM, vacuum behavior, bloat, and more—in real-time, before they become production issues. #pgpulse #supabase #databases #postgres #database #performance
Tagorenath V

Distributed Systems | Real-Time Data Engineering
2w

PostgreSQL maintains two tiny files alongside every table that affect performance: Visibility Map & Free Space Map. Visibility Map (VM): - 2 bits per page: all_visible, all_frozen - all_visible decides whether a query touches disk - all_frozen decides whether VACUUM does - If all_visible = 1, PostgreSQL skips the heap entirely and serves the query from the index (index-only scan) - If all_visible = 0, it must fetch heap page, check row visibility via xmin/xmax - If all_frozen = 1, it means all rows frozen, VACUUM skips the page entirely (saves I/O) Free Space Map (FSM): - 1 byte per page (0–255) - On INSERT or UPDATE, PostgreSQL reads that byte to find a page with room, jumps directly to that page with available space - No byte, then no shortcut. It would scan the table looking for space The catch is that both maps go stale quickly due to PostgreSQL's MVCC nature: - UPDATE = INSERT + mark old row dead - DELETE = mark row dead, space not reusable yet - In both cases: VM bit cleared, FSM not updated until space is reclaimed VACUUM fixes all of this: - Removes dead tuples - Rewrites FSM bytes to reflect real free space - Sets all_visible = 1 when the page is clean - Sets all_frozen = 1 when all rows are old enough to freeze - Prevents transaction ID wraparound (after ~2B transactions, IDs wrap and frozen rows are immune) So, next time a query gets slow, don't just ask "is the query bad?" or "is the index covered?" Ask: - When did VACUUM last run on this table? - How many dead tuples are accumulating? - Are our index-only scans actually skipping the heap? #PostgreSQL #DatabaseInternals #BackendEngineering #Performance #pgpulse
Like Comment
To view or add a comment, sign in
Pavan Kumar Bandaru
2w Edited
Report this post
#postgres users, interesting detailing on what a VACCUM can do for the performance improvements of your databases, I liked the three important questions you need to ask whenever your DB has slow queries 𝗪𝗵𝗲𝗻 𝗱𝗶𝗱 𝗩𝗔𝗖𝗨𝗨𝗠 𝗹𝗮𝘀𝘁 𝗿𝘂𝗻 𝗼𝗻 𝘁𝗵𝗶𝘀 𝘁𝗮𝗯𝗹𝗲? 𝗛𝗼𝘄 𝗺𝗮𝗻𝘆 𝗱𝗲𝗮𝗱 𝘁𝘂𝗽𝗹𝗲𝘀 𝗮𝗿𝗲 𝗮𝗰𝗰𝘂𝗺𝘂𝗹𝗮𝘁𝗶𝗻𝗴? 𝗔𝗿𝗲 𝗼𝘂𝗿 𝗶𝗻𝗱𝗲𝘅-𝗼𝗻𝗹𝘆 𝘀𝗰𝗮𝗻𝘀 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝘀𝗸𝗶𝗽𝗽𝗶𝗻𝗴 𝘁𝗵𝗲 𝗵𝗲𝗮𝗽? #postgres #pgpulse #supabase #neon #railway #DatabaseInternals #BackendEngineering #Performance
Tagorenath V

Distributed Systems | Real-Time Data Engineering
2w

PostgreSQL maintains two tiny files alongside every table that affect performance: Visibility Map & Free Space Map. Visibility Map (VM): - 2 bits per page: all_visible, all_frozen - all_visible decides whether a query touches disk - all_frozen decides whether VACUUM does - If all_visible = 1, PostgreSQL skips the heap entirely and serves the query from the index (index-only scan) - If all_visible = 0, it must fetch heap page, check row visibility via xmin/xmax - If all_frozen = 1, it means all rows frozen, VACUUM skips the page entirely (saves I/O) Free Space Map (FSM): - 1 byte per page (0–255) - On INSERT or UPDATE, PostgreSQL reads that byte to find a page with room, jumps directly to that page with available space - No byte, then no shortcut. It would scan the table looking for space The catch is that both maps go stale quickly due to PostgreSQL's MVCC nature: - UPDATE = INSERT + mark old row dead - DELETE = mark row dead, space not reusable yet - In both cases: VM bit cleared, FSM not updated until space is reclaimed VACUUM fixes all of this: - Removes dead tuples - Rewrites FSM bytes to reflect real free space - Sets all_visible = 1 when the page is clean - Sets all_frozen = 1 when all rows are old enough to freeze - Prevents transaction ID wraparound (after ~2B transactions, IDs wrap and frozen rows are immune) So, next time a query gets slow, don't just ask "is the query bad?" or "is the index covered?" Ask: - When did VACUUM last run on this table? - How many dead tuples are accumulating? - Are our index-only scans actually skipping the heap? #PostgreSQL #DatabaseInternals #BackendEngineering #Performance #pgpulse
Like Comment
To view or add a comment, sign in
Chris Northfield
2w
Report this post
You run UPDATE on a row in Postgres. Seems simple. Postgres doesn't actually update the row. It writes a new version somewhere else and leaves the old one behind as a dead tuple. DELETE does the same thing. The row isn't gone, just flagged as no longer visible. This is MVCC. It's how readers and writers can hammer the same table without blocking each other. Different transactions see different valid versions of the same row, because the old versions are still hanging around for anyone who needs them. It's pretty clever, and it's also why an UPDATE-heavy table can quietly turn into a bloated mess. A table that should be 400MB ends up being 40GB, stuffed full of row versions from the past. Scans get slower. Indexes bloat. Queries that used to be quick start dragging. That's what VACUUM is grinding away at in the background, and on busy tables it falls behind. Two things worth knowing. 1. UPDATE is basically DELETE plus INSERT under the hood. Every UPDATE leaves a dead tuple. Update the same row a thousand times a second and you're generating a thousand dead tuples a second. 2. HOT updates are the escape hatch. If you're not touching indexed columns and there's room on the same page, Postgres skips the index update and keeps the bloat down. Worth knowing when you're designing schemas. You won't notice any of this until you're staring at a huge table that should be tiny, wondering where it all went. It went into the past. Postgres just hadn't got round to vacuuming it up yet.

8 Comments
Like Comment
To view or add a comment, sign in

4,576 followers

151 Posts

View Profile Follow

Postgres Table Storage: Pages and Tuples

More Relevant Posts

Explore content categories