PostgreSQL Double‑Layer Caching
1. What is double‑layer caching?
PostgreSQL stores data in 8 kB blocks. When you run a query, the database needs those blocks in memory. PostgreSQL has its own cache – shared_buffers. But it also relies on the operating system’s page cache. So every block can live in two places at the same time inside RAM. That’s the double‑layer cache.
L1 – shared_buffers
L2 – OS page cache
Together they look like this:
Figure – The two caching layers. A block can be present in both, wasting RAM, but the second layer can also serve a read without touching disk.
2. Inside shared_buffers – the database’s own cache
shared_buffers is a fixed‑size area of memory PostgreSQL allocates at startup.
When PostgreSQL wants a page, it needs:
So it identifies a page like this:
(relation, fork, block_number)
A relation fork =a specific file of a table used for a specific purpose
Types of forks (super simple)
For one table (users), PostgreSQL may have:
1. main fork (most important)
This is what you normally think of as the table
2. FSM fork (Free Space Map)
3. VM fork (Visibility Map)
Helps optimize queries (like avoiding unnecessary checks)
When a query needs a block, the buffer manager:
The shared_buffers parameter controls its size. A typical starting point is 25 % of system RAM on a dedicated database server.
Important: PostgreSQL never bypasses the kernel; it always reads through the OS page cache.
Clock‑sweep algorithm in detail
PostgreSQL does not use a precise LRU (Least Recently Used) list. Instead it uses a clock‑sweep (also called the “second chance” or “clock” algorithm) to decide which buffer to evict when a new block must be read.
The buffer pool is treated as a circular array. Each buffer descriptor has two important counters:
Recommended by LinkedIn
A global clock hand points to the next candidate buffer, and the eviction logic works like this:
while a victim has not been chosen:
if buffer's refcount > 0:
skip (pinned)
else if buffer's usage_count > 0:
usage_count -= 1 // "second chance"
else:
choose this buffer as victim
advance the clock hand to the next buffer
This means a frequently accessed page will see its usage_count get incremented (usually when it is scanned, e.g., by a query) and will survive multiple sweeps of the clock hand before it finally reaches 0 and gets evicted. It approximates LRU with low overhead – no need to maintain a global doubly‑linked list and shuffle entries on every access.
Example
This design is lock‑free for increments (each backend just increments usage_count on its current buffer) and avoids a central LRU lock, which is critical for concurrency in a multi‑backend database.
3. OS page cache – the kernel’s invisible help
Every read() call on a data file goes through Linux’s page cache. So when PostgreSQL asks for block 42 of base/16384/12345, the kernel:
This means a block evicted from shared_buffers may still sit in the OS cache. If PostgreSQL requests it again, it will be served from RAM – no disk I/O, just a kernel copy.
That’s the good side. The bad side: the same 8 kB block can consume 16 kB of RAM – one copy in shared_buffers, one in the page cache. This is the “double‑buffering” waste.
4. Read path – two hits, two misses
Let’s walk through a single block read request.
Figure 2 – A miss in L1 may still be an L2 hit, avoiding physical disk I/O.
The planner uses effective_cache_size to estimate how much total cache (L1+L2) is available. It does not allocate memory – it’s a hint for the optimizer.
Ring buffers – protecting the cache from sequential scans
Large sequential scans (and VACUUM) would flush all hot pages out of shared_buffers if they used the full pool. PostgreSQL uses a ring buffer – a tiny set of buffers (256 kB by default) – for these operations.
Figure – Ring buffer for a sequential scan. Most blocks end up only in the OS page cache, leaving shared_buffers hot pages untouched.
So a full table scan warms the OS page cache but hardly touches the main buffer pool. Subsequent scans (or lookups that miss the pool) can still enjoy L2 hits.
5. Inspecting what is really cached
Use the pg_buffercache extension to see which tables and indexes occupy shared_buffers:
SELECT c.relname,
count(*) AS buffers,
pg_size_pretty(count(*) * 8192) AS size_in_buffer
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
WHERE b.relfilenode IS NOT NULL
GROUP BY c.relname
ORDER BY buffers DESC
LIMIT 10;
To estimate the OS cache, use tools like vmtouch or pgfincore (which calls posix_fadvise).
Summary – key takeaways
Double‑layer caching isn’t a flaw – it’s a conscious design choice that gives you two chances to avoid a disk read. Understand it, size it, and your PostgreSQL queries will fly
Fsync flag diye maybe page cache off kora jay