The Physics of Software Time: The Hidden Complexity of Linux Timekeeping

Vishnu Santhosh

Published Feb 25, 2026

+ Follow

There is a moment most engineers experience at least once.

Two services.

Two log files.

One sequence of events.

Event B happened six milliseconds before Event A.

Impossible.

Except it isn’t.

The logs are not lying.

The clocks are.

And every process on that host - every service, every application, every line of code - relies on those same clocks.

When one asks the time, they all hear the same approximation.

This article is about that approximation.

And about why trusting clocks blindly is one of the most subtle design mistakes in distributed systems.

By the end, you will not trust timestamps the way you used to.

That loss of trust will make you a better engineer.

The Assumption Nobody Questions

We grow up with a simple model of time:

It moves forward.
It moves at the same rate for everyone.
It is a number.

Engineering inherits that assumption.

Physics disagrees.

Every computer keeps time using a crystal oscillator. The crystal vibrates. The chip counts vibrations.

But crystals drift.

Temperature changes them. Age changes them. Manufacturing variance changes them.

Two machines started at the same time will not stay synchronized. They can be milliseconds apart within hours.

Now scale that across continents.

Then ask those machines to agree on:

Payment ordering
Log sequencing
Conflict resolution
Cache invalidation

You are asking drifting hardware to produce a coherent history.

Linux knows this. That is why it does not give you one clock. It gives you four.

Four Clocks, Four Different Questions

Time is not one thing.

It depends on what you’re measuring.

1. CLOCK_REALTIME

Wall clock time.

Seconds since Jan 1, 1970.

This is what date command shows. What humans mean by "time."

It can move forward.

It can move move backward if NTP steps.

It can forward instantly if NTP steps.

An admin can change it.

Leap seconds affect it.

Never use this for durations.

Here is what happens when you do:

// Using REALTIME for timeout
struct timespec start, now;
clock_gettime(CLOCK_REALTIME, &start);  // Bug!

while (!packet_received()) {
    clock_gettime(CLOCK_REALTIME, &now);
    
    // If NTP stepped the clock backward during this loop:
    // now.tv_sec - start.tv_sec could be 0, or even negative
    // Your 5-second timeout might wait 5 minutes, or forever
    
    if ((now.tv_sec - start.tv_sec) >= 5) {
        printf("Timeout\n");
        break;
    }
}

Why can it go backward?

NTP (Network Time Protocol) can step the clock, just like you manually rotate your watch time by adjusting the dial - to correct large errors. We'll explore exactly how this works later.

This is not theoretical. It happens in production.

2. CLOCK_MONOTONIC

Elapsed time since boot (roughly).

It never moves backward.

It's unaffected by NTP adjustments or admin changes.

Perfect for:

Timeouts
Intervals
Measuring durations

But it ignores suspend.

A laptop sleeps for 8 hours. CLOCK_MONOTONIC pretends nothing happened.

What every engineer actually uses it for:

// Using MONOTONIC for a network timeout
struct timespec start, now;
clock_gettime(CLOCK_MONOTONIC, &start);

while (!packet_received()) {
    clock_gettime(CLOCK_MONOTONIC, &now);
    
    // Has 5 seconds passed?
    if ((now.tv_sec - start.tv_sec) >= 5) {
        printf("Network timeout: no response in 5 seconds\n");
        break;
    }
    
    usleep(100); // Don't busy-wait
}

3. CLOCK_MONOTONIC_RAW

It reads directly from the hardware counter (TSC).

No NTP adjustments. No smoothing.

It's always increasing.

Useful for:

Hardware profiling with calibration tools
Scientists measuring oscillator behavior

What breaks when you use it wrong:

struct timespec start, end;

// BUG: Using RAW for application timing
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
process_video_frame();
clock_gettime(CLOCK_MONOTONIC_RAW, &end);

// If CPU frequency changed or temperature drifted:
// The time might not match real-world elapsed time
// Your performance measurements are skewed

Almost never correct for application logic.

4. CLOCK_BOOTTIME

Like MONOTONIC, but includes suspend time.

It's always increasing. Not affected by NTP.

Useful when uptime must include sleep. Rarely correct otherwise.

When you actually need it:

// Using BOOTTIME for a backup scheduler
struct timespec start, now;
clock_gettime(CLOCK_BOOTTIME, &start);

while (1) {
    sleep(3600);  // Wait 1 hour
    clock_gettime(CLOCK_BOOTTIME, &now);
    
    // Even if system suspended, this correctly shows
    // that 1+ hours have passed
    if ((now.tv_sec - start.tv_sec) >= 86400) {
        run_daily_backup();
        start = now;
    }
}

So why not always use BOOTTIME? Because it's slightly slower to read (more kernel work).

When you actually need it:

Services that must run after suspend (cron jobs, backup schedulers)
Uptime monitoring that should include sleep
Anything that needs "total time since boot including naps"

Use CLOCK_BOOTTIME only when your code must account for system sleep. For most timeouts and intervals, MONOTONIC is what you want.

Four clocks. Four different answers to “what time is it?”

Most code never asks which question it is asking.

The Kernel’s Shortcut (vDSO)

Reading time millions of times per second is expensive.

A syscall costs roughly one to four microseconds on modern hardware. For most operations this is fine. For reading time millions of times per second, it adds up fast.

So Linux cheats.

The vDSO (virtual Dynamic Shared Object) is a clever optimization used for that.

When a Linux process starts, the kernel maps a small region of its own memory into the process's address space. This region is read-only from the process's perspective. Inside it, the kernel continuously writes updated time information.

When your code calls clock_gettime(), the C library does not make a system call. Instead, it reads from this mapped region directly. The time is already there, in your process's own memory, maintained by the kernel behind the scenes.

50–100x faster.

Elegant.

But the value inside that page is built on something else.

The TSC.

The Crystal That Drifts (TSC)

The Time Stamp Counter (TSC) is a special register in your CPU that increments on every clock cycle. On a 3 GHz processor, it increments three billion times per second. Reading it takes a single instruction.

Fast. Simple. But Dangerous.

Early CPUs had a problem: The TSC frequency was tied to the CPU's current clock speed.

Speed up CPU → time moves faster.

Slow down CPU → time moves slower.

If the CPU slowed down to save power, the TSC would also slow down.

This made it unreliable for measuring real-world time.

Modern CPUs solved this with invariant TSC. This is a hardware feature that guarantees the TSC increments at a fixed frequency, usually tied to the processor's maximum speed or the system's reference clock. It no longer fluctuates with the CPU's dynamic clock speed.

This constant rate holds true whether the CPU is:

Running at full speed (P-state)
Idle in a sleep state (C-state)
Throttling for thermal management (T-state)

This means you can trust it for timekeeping. Mostly.

The Art of Controlled Correction (NTP)

Even if every machine had a perfect hardware clock, synchronized clocks would still drift. Quartz crystals and oscillators are physical objects in a physical environment. They age. They respond to temperature. They simply do not stay exactly right forever.

When the Network Disappears

NTP needs to talk to its time servers. It needs the network.

Network partition is when the network breaks into pieces that cannot communicate.

Imagine your office building has two wings. At 2:00 PM, someone locks the door between them.

You can still talk to people in your wing
You cannot reach anyone in the other wing
Both wings continue working, but they cannot share updates

That is a network partition.

During a network partition:

Your machine is running fine
NTP clients cannot reach their servers
No corrections happen at all
Your clock continues drifting, uncorrected
If the partition lasts an hour, your clock may be seconds off

When connectivity returns, NTP discovers the error. If it's large enough, it may step the clock -causing that sudden jump we just discussed.

When Clocks Travel (Distributed Systems)

So far, we've talked about time on a single machine.

But modern systems are not single machines. They are dozens, hundreds, sometimes thousands of machines working together. And they all need to agree on what happened when.

This is where time really starts to break.

Imagine two machines in the same data center:

Machine A in rack 1
Machine B in rack 2

Both run NTP. Both are synchronized to within 3 milliseconds of the true time. This is considered good. Many data centers operate with 1–5ms skew between machines.

Now, what does "within 3ms of true time" actually mean?

It means:

Machine A's clock might be 2ms ahead of reality
Machine B's clock might be 1ms behind reality
The difference between them: 3ms

They disagree by 3 milliseconds.

For a human, this is irrelevant. You cannot perceive 3ms. Your brain doesn't register it.

In practice, 3ms delay can cause serious issues.

Let's take a banking example.

Two transactions hit the same account at nearly the same moment, in an account with an initial balance of ₹50:

Transaction X: Check if balance ≥ ₹100, then withdraw ₹100
Transaction Y: Deposit ₹100

Now, the order matters:

If Y happens first (deposit then withdraw):

Y deposits ₹100 → balance becomes ₹50 + ₹100 = ₹150
X checks: ₹150 ≥ ₹100 → withdraws ₹100 → balance becomes ₹50
Result: X succeeds, final balance ₹50

If X happens first (withdraw then deposit):

X checks: ₹50 < ₹100 → withdrawal fails, balance remains ₹50
Y deposits ₹100 → balance becomes ₹150
Result: X fails, final balance ₹150

Same two transactions, different order, completely different outcomes.

This is a race condition. And when machines disagree on timestamps due to clock skew, the database may apply the transactions in the wrong order - leading to incorrect business logic.

How Google Solved This

Google runs some of the largest distributed systems in the world. Spanner is their globally distributed database. It spans data centers across continents.

They needed a way to order transactions correctly, even with clock skew.

Their solution is called TrueTime.

TrueTime does not return a single number. It returns an interval.

Instead of saying: "The current time is 10:00:05.000"

TrueTime says: "The current time is somewhere between 10:00:04.997 and 10:00:05.003"

The width of that interval is the uncertainty. Google knows how much their clocks might be wrong (thanks to GPS and atomic clocks in each data center), and they expose that uncertainty directly.

TrueTime interval:
  earliest = now - uncertainty
  latest   = now + uncertainty

Example:
  earliest = 10:00:04.997
  latest   = 10:00:05.003
  width    = 6 milliseconds

How TrueTime Orders Events:

When Spanner wants to commit a transaction, it does something clever:

Read the current TrueTime interval
Use the latest possible time as the commit timestamp
Wait until that timestamp is guaranteed to be in the past

The rule is: Wait until earliest(commit) > latest(previous)

Let's see this in action:

Transaction 1 commits:

Reads TrueTime: [10:00:04.997, 10:00:05.003]
Chooses commit timestamp: 10:00:05.003 (the latest)
Waits until real time > 10:00:05.003
Then commits

Transaction 2 starts after Transaction 1 finishes:

Reads TrueTime: [10:00:05.004, 10:00:05.010]
Earliest possible time (10:00:05.004) is greater than Transaction 1's latest (10:00:05.003)
Guaranteed: Transaction 2 happened after Transaction 1

The waiting ensures that even if clocks are wrong, the ordering is correct.

TrueTime forces you to think about time differently.

Now time is not a point, it is a range.

You probably don't have atomic clocks in your data centers. You probably can't implement TrueTime exactly.

But you can absorb its lesson:

If your system orders events by comparing timestamps from different machines, you are relying on luck.

Because even in the best-managed data centers, clocks disagree by milliseconds. And milliseconds are enough to corrupt order.

The Timestamp Fallacy

The most dangerous assumption in distributed systems:

“A timestamp represents reality.”

It does not.

A timestamp represents what one machine believed at one moment, based on its own imperfect clock, after whatever NTP corrections happened to have been applied, with whatever drift it had accumulated since the last sync.

Timestamps are evidence, not truth.

Correct systems treat time as:

Approximate - Never assume it's exact
Local - Don't compare across machines without accounting for uncertainty
Bounded by uncertainty - Know how wrong your clocks might be

If you need strict ordering across machines, don't use wall-clock timestamps. Use tools designed for ordering:

Logical clocks (Lamport timestamps, vector clocks) that track causality, not wall time
Consensus algorithms (Paxos, Raft) that agree on order through communication
Sequencers that assign monotonically increasing IDs
CRDTs that make order irrelevant

These tools do not trust the clock. They trust math and communication.

The Four Clocks Cheat Sheet

// Rotate your phone if table is distorted.
┌─────────────────────────────────────────────────────┐
│         THE FOUR CLOCKS CHEAT SHEET                 │
├─────────────────────────────────────────────────────┤
│  CLOCK_REALTIME    │ Wall time (adjustable)         │
│                    │ Use for: logs, displays        │
│                    │ Never use for: durations       │
├─────────────────────────────────────────────────────┤
│  CLOCK_MONOTONIC   │ Elapsed time (never backwards) │
│                    │ Use for: timeouts, intervals   │
│                    │ Never use for: cross-machine   │
├─────────────────────────────────────────────────────┤
│  CLOCK_MONOTONIC_RAW│ Raw hardware (no NTP adj)     │
│                    │ Use for: hardware profiling    │
│                    │ Never use for: application     │
├─────────────────────────────────────────────────────┤
│  CLOCK_BOOTTIME    │ Includes suspend time          │
│                    │ Use for: uptime across sleep   │
│                    │ Never use for: most things     │
└─────────────────────────────────────────────────────┘

The number your system returns when you ask for the time is not the time.

It is a carefully maintained estimate. Built from:

Crystal physics
CPU counters
Kernel math
Network corrections
Distributed compromise

The clock is almost always right.

Almost always is not always.

And "almost always" is not good enough for systems that depend on ordering events across machines.

The clock is always running.

It is just not always telling the truth.

If you enjoyed this, I write about systems engineering, Linux internals, and the evolving relationship between software and hardware. Follow for more deep dives on operating system architecture.

Frank Leonhardt 2mo

Hehe. And well said. The Linux world believes it's a real-time operating system without knowing what real-time means. Unix knows it's a batch processing system. When Intel introduced the TSC with the 586 it was indeed pretty good. Not so much now. In FreeBSD you can check if you have a corrected one with sysctl kern.timecounter.invariant_tsc. Most of the time it is, but goodness knows how this applies to a virtual machine. There's probably a way in Linux too. And then what if you have multiple CPU cores or multiple processors on a motherboard? Which TSC are we actually reading?!? Unless you're actually using an RTOS, and as Vishnu says above, "Use tools designed for ordering".

2 Reactions

Dan K. 2mo

Vishnu Santhosh: This is what one gets from organically grown operating systems lacking formal specification and formal verification.

The Physics of Software Time: The Hidden Complexity of Linux Timekeeping

Vishnu Santhosh

The Assumption Nobody Questions

Four Clocks, Four Different Questions

The Kernel’s Shortcut (vDSO)

The Crystal That Drifts (TSC)

The Art of Controlled Correction (NTP)

Recommended by LinkedIn

When the Network Disappears

When Clocks Travel (Distributed Systems)

How Google Solved This

The Timestamp Fallacy

The Four Clocks Cheat Sheet

More articles by this author

Others also viewed

Clear Linux Project for Intel(r) Architecture Continues to Innovate with Function Multiversioning Support

Setting up a C++ Barcode Scanner on Raspberry Pi

Turns out words are tricky even for computers TL;DR

An IT-engineer's life...

Keeping up with Commodore

Secure Boot for Embedded System

How to run Doom on IBM Z and LinuxONE

The Art of Creating Minimal ELF64 Executables by Unconventional Methods

Taming Hard Faults in Zephyr OS: A Practical Guide for Embedded Developers

The Best Computer You Can Still Buy New!

Explore content categories

The Assumption Nobody Questions

Four Clocks, Four Different Questions

The Kernel’s Shortcut (vDSO)

The Crystal That Drifts (TSC)

The Art of Controlled Correction (NTP)

Recommended by LinkedIn

When the Network Disappears

When Clocks Travel (Distributed Systems)

How Google Solved This

The Timestamp Fallacy

The Four Clocks Cheat Sheet

The authencesn Bug: How a 4-Byte Tag Write Became a Linux Root Exploit

May 1, 2026

Linux D-Bus Internals: What Happens When We Send a Message

Apr 26, 2026

Systemd Isn’t an Init System - It’s Almost an OS Inside Your OS

Apr 19, 2026

CXL Explained: How Memory Is Becoming a Distributed System in Linux

Apr 12, 2026

eBPF: The Programmable Kernel That Looks Like a Tracing Tool

Apr 9, 2026

io_uring Explained: Linux Async I/O Without the Syscall Overhead

Apr 5, 2026

Inside the Live Update Orchestrator: How Linux Solved Kernel Updates for Running VMs

Apr 2, 2026

The Real Cost of Context Switching in Linux: Cache, TLB, and CPU Warmup Explained

Mar 29, 2026

PagedAttention: How LLM Inference Just Reinvented Unix’s Virtual Memory for GPUs

Mar 25, 2026

NULLFS - The Empty Filesystem That Reshapes Container Security in Linux

Mar 22, 2026

Others also viewed

Clear Linux Project for Intel(r) Architecture Continues to Innovate with Function Multiversioning Support

Setting up a C++ Barcode Scanner on Raspberry Pi

Turns out words are tricky even for computers TL;DR

An IT-engineer's life...

Keeping up with Commodore

Secure Boot for Embedded System

How to run Doom on IBM Z and LinuxONE

The Art of Creating Minimal ELF64 Executables by Unconventional Methods

Taming Hard Faults in Zephyr OS: A Practical Guide for Embedded Developers

The Best Computer You Can Still Buy New!

Explore content categories