The Physics of Software Time: The Hidden Complexity of Linux Timekeeping
There is a moment most engineers experience at least once.
Two services.
Two log files.
One sequence of events.
Event B happened six milliseconds before Event A.
Impossible.
Except it isn’t.
The logs are not lying.
The clocks are.
And every process on that host - every service, every application, every line of code - relies on those same clocks.
When one asks the time, they all hear the same approximation.
This article is about that approximation.
And about why trusting clocks blindly is one of the most subtle design mistakes in distributed systems.
By the end, you will not trust timestamps the way you used to.
That loss of trust will make you a better engineer.
The Assumption Nobody Questions
We grow up with a simple model of time:
Engineering inherits that assumption.
Physics disagrees.
Every computer keeps time using a crystal oscillator. The crystal vibrates. The chip counts vibrations.
But crystals drift.
Temperature changes them. Age changes them. Manufacturing variance changes them.
Two machines started at the same time will not stay synchronized. They can be milliseconds apart within hours.
Now scale that across continents.
Then ask those machines to agree on:
You are asking drifting hardware to produce a coherent history.
Linux knows this. That is why it does not give you one clock. It gives you four.
Four Clocks, Four Different Questions
Time is not one thing.
It depends on what you’re measuring.
1. CLOCK_REALTIME
Wall clock time.
Seconds since Jan 1, 1970.
This is what date command shows. What humans mean by "time."
It can move forward.
It can move move backward if NTP steps.
It can forward instantly if NTP steps.
An admin can change it.
Leap seconds affect it.
Never use this for durations.
Here is what happens when you do:
// Using REALTIME for timeout
struct timespec start, now;
clock_gettime(CLOCK_REALTIME, &start); // Bug!
while (!packet_received()) {
clock_gettime(CLOCK_REALTIME, &now);
// If NTP stepped the clock backward during this loop:
// now.tv_sec - start.tv_sec could be 0, or even negative
// Your 5-second timeout might wait 5 minutes, or forever
if ((now.tv_sec - start.tv_sec) >= 5) {
printf("Timeout\n");
break;
}
}
Why can it go backward?
NTP (Network Time Protocol) can step the clock, just like you manually rotate your watch time by adjusting the dial - to correct large errors. We'll explore exactly how this works later.
This is not theoretical. It happens in production.
2. CLOCK_MONOTONIC
Elapsed time since boot (roughly).
It never moves backward.
It's unaffected by NTP adjustments or admin changes.
Perfect for:
But it ignores suspend.
A laptop sleeps for 8 hours. CLOCK_MONOTONIC pretends nothing happened.
What every engineer actually uses it for:
// Using MONOTONIC for a network timeout
struct timespec start, now;
clock_gettime(CLOCK_MONOTONIC, &start);
while (!packet_received()) {
clock_gettime(CLOCK_MONOTONIC, &now);
// Has 5 seconds passed?
if ((now.tv_sec - start.tv_sec) >= 5) {
printf("Network timeout: no response in 5 seconds\n");
break;
}
usleep(100); // Don't busy-wait
}
3. CLOCK_MONOTONIC_RAW
It reads directly from the hardware counter (TSC).
No NTP adjustments. No smoothing.
It's always increasing.
Useful for:
What breaks when you use it wrong:
struct timespec start, end;
// BUG: Using RAW for application timing
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
process_video_frame();
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
// If CPU frequency changed or temperature drifted:
// The time might not match real-world elapsed time
// Your performance measurements are skewed
Almost never correct for application logic.
4. CLOCK_BOOTTIME
Like MONOTONIC, but includes suspend time.
It's always increasing. Not affected by NTP.
Useful when uptime must include sleep. Rarely correct otherwise.
When you actually need it:
// Using BOOTTIME for a backup scheduler
struct timespec start, now;
clock_gettime(CLOCK_BOOTTIME, &start);
while (1) {
sleep(3600); // Wait 1 hour
clock_gettime(CLOCK_BOOTTIME, &now);
// Even if system suspended, this correctly shows
// that 1+ hours have passed
if ((now.tv_sec - start.tv_sec) >= 86400) {
run_daily_backup();
start = now;
}
}
So why not always use BOOTTIME? Because it's slightly slower to read (more kernel work).
When you actually need it:
Use CLOCK_BOOTTIME only when your code must account for system sleep. For most timeouts and intervals, MONOTONIC is what you want.
Four clocks. Four different answers to “what time is it?”
Most code never asks which question it is asking.
The Kernel’s Shortcut (vDSO)
Reading time millions of times per second is expensive.
A syscall costs roughly one to four microseconds on modern hardware. For most operations this is fine. For reading time millions of times per second, it adds up fast.
So Linux cheats.
The vDSO (virtual Dynamic Shared Object) is a clever optimization used for that.
When a Linux process starts, the kernel maps a small region of its own memory into the process's address space. This region is read-only from the process's perspective. Inside it, the kernel continuously writes updated time information.
When your code calls clock_gettime(), the C library does not make a system call. Instead, it reads from this mapped region directly. The time is already there, in your process's own memory, maintained by the kernel behind the scenes.
50–100x faster.
Elegant.
But the value inside that page is built on something else.
The TSC.
The Crystal That Drifts (TSC)
The Time Stamp Counter (TSC) is a special register in your CPU that increments on every clock cycle. On a 3 GHz processor, it increments three billion times per second. Reading it takes a single instruction.
Fast. Simple. But Dangerous.
Early CPUs had a problem: The TSC frequency was tied to the CPU's current clock speed.
Speed up CPU → time moves faster.
Slow down CPU → time moves slower.
If the CPU slowed down to save power, the TSC would also slow down.
This made it unreliable for measuring real-world time.
Modern CPUs solved this with invariant TSC. This is a hardware feature that guarantees the TSC increments at a fixed frequency, usually tied to the processor's maximum speed or the system's reference clock. It no longer fluctuates with the CPU's dynamic clock speed.
This constant rate holds true whether the CPU is:
This means you can trust it for timekeeping. Mostly.
The Art of Controlled Correction (NTP)
Even if every machine had a perfect hardware clock, synchronized clocks would still drift. Quartz crystals and oscillators are physical objects in a physical environment. They age. They respond to temperature. They simply do not stay exactly right forever.
Recommended by LinkedIn
NTP is the service that corrects them.
It runs on every production server, quietly doing its job that most engineers forget exists.
NTP periodically asks a trusted time server:
"What time is it, really?"
It measures how long the question and answer took to travel across the network, accounts for that delay, and then compares the answer to its own clock.
If your machine is 50 milliseconds ahead, NTP knows. Then it must decide how to fix it.
Two ways to correct time: Slewing and Stepping.
Normally, NTP uses slewing. It gradually speeds up or slows down your system clock - just a tiny bit -until it matches the real time.
From your application's perspective:
Time never jumps backward.
Time never jumps forward.
It just runs slightly faster or slower for a while.
A clock running at 110% speed for a few minutes to catch up is invisible to your code. Timeouts still fire. Logs stay ordered. Everything works.
This is the safe way to correct time.
But slewing has limits.
The kernel will not slew a clock by more than about 500 milliseconds by default. Why? Because fixing a large error by slewing would take too long. A clock that's 5 seconds slow would need to run at 110% speed for 50 seconds to catch up. That's 50 seconds of weirdness.
If the error is larger than ~500ms, NTP does something more drastic: it steps the clock.
That means instantly jumping forward or backward to the correct time.
// What happens during a step:
// 10:00:05.000 → (NTP says you're 200ms ahead) → 10:00:04.800
// Time just went backward 200ms in an instant.
This is the dangerous mode.
Which means: CLOCK_REALTIME can jump. Even in production. Even while your code is running.
When the Network Disappears
NTP needs to talk to its time servers. It needs the network.
Network partition is when the network breaks into pieces that cannot communicate.
Imagine your office building has two wings. At 2:00 PM, someone locks the door between them.
That is a network partition.
During a network partition:
When connectivity returns, NTP discovers the error. If it's large enough, it may step the clock -causing that sudden jump we just discussed.
When Clocks Travel (Distributed Systems)
So far, we've talked about time on a single machine.
But modern systems are not single machines. They are dozens, hundreds, sometimes thousands of machines working together. And they all need to agree on what happened when.
This is where time really starts to break.
Imagine two machines in the same data center:
Both run NTP. Both are synchronized to within 3 milliseconds of the true time. This is considered good. Many data centers operate with 1–5ms skew between machines.
Now, what does "within 3ms of true time" actually mean?
It means:
They disagree by 3 milliseconds.
For a human, this is irrelevant. You cannot perceive 3ms. Your brain doesn't register it.
In practice, 3ms delay can cause serious issues.
Let's take a banking example.
Two transactions hit the same account at nearly the same moment, in an account with an initial balance of ₹50:
Now, the order matters:
If Y happens first (deposit then withdraw):
If X happens first (withdraw then deposit):
Same two transactions, different order, completely different outcomes.
This is a race condition. And when machines disagree on timestamps due to clock skew, the database may apply the transactions in the wrong order - leading to incorrect business logic.
How Google Solved This
Google runs some of the largest distributed systems in the world. Spanner is their globally distributed database. It spans data centers across continents.
They needed a way to order transactions correctly, even with clock skew.
Their solution is called TrueTime.
TrueTime does not return a single number. It returns an interval.
Instead of saying: "The current time is 10:00:05.000"
TrueTime says: "The current time is somewhere between 10:00:04.997 and 10:00:05.003"
The width of that interval is the uncertainty. Google knows how much their clocks might be wrong (thanks to GPS and atomic clocks in each data center), and they expose that uncertainty directly.
TrueTime interval:
earliest = now - uncertainty
latest = now + uncertainty
Example:
earliest = 10:00:04.997
latest = 10:00:05.003
width = 6 milliseconds
How TrueTime Orders Events:
When Spanner wants to commit a transaction, it does something clever:
The rule is: Wait until earliest(commit) > latest(previous)
Let's see this in action:
Transaction 1 commits:
Transaction 2 starts after Transaction 1 finishes:
The waiting ensures that even if clocks are wrong, the ordering is correct.
TrueTime forces you to think about time differently.
Now time is not a point, it is a range.
You probably don't have atomic clocks in your data centers. You probably can't implement TrueTime exactly.
But you can absorb its lesson:
If your system orders events by comparing timestamps from different machines, you are relying on luck.
Because even in the best-managed data centers, clocks disagree by milliseconds. And milliseconds are enough to corrupt order.
The Timestamp Fallacy
The most dangerous assumption in distributed systems:
“A timestamp represents reality.”
It does not.
A timestamp represents what one machine believed at one moment, based on its own imperfect clock, after whatever NTP corrections happened to have been applied, with whatever drift it had accumulated since the last sync.
Timestamps are evidence, not truth.
Correct systems treat time as:
If you need strict ordering across machines, don't use wall-clock timestamps. Use tools designed for ordering:
These tools do not trust the clock. They trust math and communication.
The Four Clocks Cheat Sheet
// Rotate your phone if table is distorted.
┌─────────────────────────────────────────────────────┐
│ THE FOUR CLOCKS CHEAT SHEET │
├─────────────────────────────────────────────────────┤
│ CLOCK_REALTIME │ Wall time (adjustable) │
│ │ Use for: logs, displays │
│ │ Never use for: durations │
├─────────────────────────────────────────────────────┤
│ CLOCK_MONOTONIC │ Elapsed time (never backwards) │
│ │ Use for: timeouts, intervals │
│ │ Never use for: cross-machine │
├─────────────────────────────────────────────────────┤
│ CLOCK_MONOTONIC_RAW│ Raw hardware (no NTP adj) │
│ │ Use for: hardware profiling │
│ │ Never use for: application │
├─────────────────────────────────────────────────────┤
│ CLOCK_BOOTTIME │ Includes suspend time │
│ │ Use for: uptime across sleep │
│ │ Never use for: most things │
└─────────────────────────────────────────────────────┘
The number your system returns when you ask for the time is not the time.
It is a carefully maintained estimate. Built from:
The clock is almost always right.
Almost always is not always.
And "almost always" is not good enough for systems that depend on ordering events across machines.
The clock is always running.
It is just not always telling the truth.
If you enjoyed this, I write about systems engineering, Linux internals, and the evolving relationship between software and hardware. Follow for more deep dives on operating system architecture.
Hehe. And well said. The Linux world believes it's a real-time operating system without knowing what real-time means. Unix knows it's a batch processing system. When Intel introduced the TSC with the 586 it was indeed pretty good. Not so much now. In FreeBSD you can check if you have a corrected one with sysctl kern.timecounter.invariant_tsc. Most of the time it is, but goodness knows how this applies to a virtual machine. There's probably a way in Linux too. And then what if you have multiple CPU cores or multiple processors on a motherboard? Which TSC are we actually reading?!? Unless you're actually using an RTOS, and as Vishnu says above, "Use tools designed for ordering".
Vishnu Santhosh: This is what one gets from organically grown operating systems lacking formal specification and formal verification.