PostgreSQL Double‑Layer Caching

PostgreSQL Double‑Layer Caching

1. What is double‑layer caching?

PostgreSQL stores data in 8 kB blocks. When you run a query, the database needs those blocks in memory. PostgreSQL has its own cache – shared_buffers. But it also relies on the operating system’s page cache. So every block can live in two places at the same time inside RAM. That’s the double‑layer cache.

L1 – shared_buffers

  • In‑process, managed by PostgreSQL’s buffer manager
  • Directly accessible for queries, writes, checksums

L2 – OS page cache

  • Managed by the kernel (Linux: page cache)
  • Contains file pages from the data directory, including heap tables, indexes, WAL

Together they look like this:

Article content

Figure – The two caching layers. A block can be present in both, wasting RAM, but the second layer can also serve a read without touching disk.



2. Inside shared_buffers – the database’s own cache

shared_buffers is a fixed‑size area of memory PostgreSQL allocates at startup.

Article content

When PostgreSQL wants a page, it needs:

  • Which table?
  • Which fork?
  • Which block inside that fork?

So it identifies a page like this:

(relation, fork, block_number)        

A relation fork =a specific file of a table used for a specific purpose

Types of forks (super simple)

For one table (users), PostgreSQL may have:

1. main fork (most important)

  • Stores actual data (rows)

This is what you normally think of as the table

2. FSM fork (Free Space Map)

  • Tracks which pages have empty space
  • Helps PostgreSQL know: “Where can I insert new data?”

3. VM fork (Visibility Map)

  • Tracks which pages are “clean” (no dead rows)

Helps optimize queries (like avoiding unnecessary checks)

When a query needs a block, the buffer manager:

  1. Computes a hash from the relation fork + block number
  2. Looks up the buffer in the buffer table
  3. If found → hit → returns pointer (no copy)
  4. If not → miss → allocates or evicts a victim page, reads the block from the kernel

The shared_buffers parameter controls its size. A typical starting point is 25 % of system RAM on a dedicated database server.

Important: PostgreSQL never bypasses the kernel; it always reads through the OS page cache.


Clock‑sweep algorithm in detail

PostgreSQL does not use a precise LRU (Least Recently Used) list. Instead it uses a clock‑sweep (also called the “second chance” or “clock” algorithm) to decide which buffer to evict when a new block must be read.

The buffer pool is treated as a circular array. Each buffer descriptor has two important counters:

  • usage_count – can range from 0 up to a maximum (typically 5). Starts at 1 when a page is first read into the buffer.
  • refcount – number of backends currently pinning the page; a page with refcount > 0 cannot be evicted.

A global clock hand points to the next candidate buffer, and the eviction logic works like this:

while a victim has not been chosen:
   if buffer's refcount > 0:
       skip (pinned)
   else if buffer's usage_count > 0:
       usage_count -= 1          // "second chance"
   else:
       choose this buffer as victim

   advance the clock hand to the next buffer        

This means a frequently accessed page will see its usage_count get incremented (usually when it is scanned, e.g., by a query) and will survive multiple sweeps of the clock hand before it finally reaches 0 and gets evicted. It approximates LRU with low overhead – no need to maintain a global doubly‑linked list and shuffle entries on every access.

Example

  • Hot page P has usage_count = 4.
  • Clock hand sweeps past it 3 times, each time decrementing to 3, 2, 1. P remains in memory.
  • On the fourth pass, usage_count becomes 0. If at that moment no backend is using the page (refcount = 0), P is chosen as victim and evicted.
  • But if the page is re‑used between sweeps (a query bumps usage_count back to 5), the clock hand keeps skipping it – P effectively stays cached forever as long as it’s regularly touched.

This design is lock‑free for increments (each backend just increments usage_count on its current buffer) and avoids a central LRU lock, which is critical for concurrency in a multi‑backend database.


3. OS page cache – the kernel’s invisible help

Every read() call on a data file goes through Linux’s page cache. So when PostgreSQL asks for block 42 of base/16384/12345, the kernel:

  • checks its own cache
  • if found → copies it straight to PostgreSQL’s shared_buffers (extremely fast)
  • if not → reads from disk, populates the page cache, then copies to shared_buffers

This means a block evicted from shared_buffers may still sit in the OS cache. If PostgreSQL requests it again, it will be served from RAM – no disk I/O, just a kernel copy.

That’s the good side. The bad side: the same 8 kB block can consume 16 kB of RAM – one copy in shared_buffers, one in the page cache. This is the “double‑buffering” waste.



4. Read path – two hits, two misses

Let’s walk through a single block read request.


Article content

Figure 2 – A miss in L1 may still be an L2 hit, avoiding physical disk I/O.

  • L1 hit → ~hundreds of nanoseconds (memory access + lock)
  • L2 hit → ~microseconds (kernel copy, no disk)
  • Disk miss → milliseconds

The planner uses effective_cache_size to estimate how much total cache (L1+L2) is available. It does not allocate memory – it’s a hint for the optimizer.



Ring buffers – protecting the cache from sequential scans

Large sequential scans (and VACUUM) would flush all hot pages out of shared_buffers if they used the full pool. PostgreSQL uses a ring buffer – a tiny set of buffers (256 kB by default) – for these operations.

Article content

Figure – Ring buffer for a sequential scan. Most blocks end up only in the OS page cache, leaving shared_buffers hot pages untouched.

So a full table scan warms the OS page cache but hardly touches the main buffer pool. Subsequent scans (or lookups that miss the pool) can still enjoy L2 hits.



5. Inspecting what is really cached

Use the pg_buffercache extension to see which tables and indexes occupy shared_buffers:

SELECT c.relname,
       count(*) AS buffers,
       pg_size_pretty(count(*) * 8192) AS size_in_buffer
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
WHERE b.relfilenode IS NOT NULL
GROUP BY c.relname
ORDER BY buffers DESC
LIMIT 10;        

To estimate the OS cache, use tools like vmtouch or pgfincore (which calls posix_fadvise).



Summary – key takeaways

  1. PostgreSQL always reads data through the OS page cache; it never bypasses it with direct I/O by default.
  2. A block can reside simultaneously in shared_buffers (L1) and the OS page cache (L2) – this speeds up re‑reads but wastes RAM.
  3. Ring buffers protect shared_buffers from large sequential scans, leaving hot pages in L1 and warming the OS cache.
  4. Tune based on your workload: small hot datasets prefer a big L1; huge datasets with mixed access rely on the OS cache as a scalable second layer.

Double‑layer caching isn’t a flaw – it’s a conscious design choice that gives you two chances to avoid a disk read. Understand it, size it, and your PostgreSQL queries will fly

Fsync flag diye maybe page cache off kora jay

To view or add a comment, sign in

Others also viewed

Explore content categories