Subtle but Intriguing Detail of Nested Virtualization

Jaewon Hur

Published Aug 31, 2025

Nested virtualization is a cool feature of Linux KVM that can theoretically host infinite layers of nested VMs. KVM provides a virtualization so that the hypervisor running in the VM doesn't even know it's running in a VM, and thinks as if it is hosting a VM on a bare metal machine. Imagine a Matrix in a Matrix. What a cool technology? Nested virtualization was first introduced by Turtles project [1], and now, fully functioning in KVM.

While I studied KVM source code for a while, still there are lots of parts that I cannot understand. To remember what I have learned, I wrote an article about a small interesting implementation detail of nested virtualization: nested virtual memory management.

As you may know, hypervisor uses Second Level Address Translation (SLAT) to translate the Guest Physical Address (GPA) to Host Physical Address (HPA). In Intel, Extended Page Table (EPT) is used for this SLAT. Once the hypervisor builds the EPT and registers it to the CPU, MMU seamlessly translates the GPA to HPA while the guest VM is running.

However, this becomes quite complicated when it comes to the nested virtualization. The hypervisor running in VM (i.e., L1 hypervisor) will build the EPT in its own physical address space, and tries to run the nested VM (i.e., L2 guest) using it. Thus, that EPT actually translates the nested VM's physical address to the physical address of L1 VM (i.e., L2 GPA to L1 GPA), and it cannot work on bare metal. Let me call this EPT a L1 EPT.

To handle this case, KVM (i.e., L0 hypervisor) internally builds a new EPT for the nested VM (i.e., shadow EPT). This EPT translates the L2 GPA to HPA so that MMU can seamlessly handle the memory operations of L2 VM. Upon an EPT violation caused by a missing EPT entry, KVM walks the (virtually) registered L1 EPT by L1 hypervisor, finds the related entry (which translates L2 GPA to L1 GPA), then searches the corresponding HPA and populates the shadow EPT entry (i.e., L2 GPA to HPA).

Finally, I've come to what I really want to say. Then, how would be the invept instruction (i.e., invalidate EPT) emulated? Invept instruction flushes all cached EPT translation results in hardware buffer (i.e., EPT TLB) so that the MMU walks the (possibly modified) EPT again. How should we emulate when a L1 hypervisor executes an invept instruction? Naively executing invept in the L0 hypervisor again doesn't work, because the MMU will only walk the shadow EPT again, but not the modified L1 EPT.

Recommended by LinkedIn

CPU + IPU : Why Multi Cluster Orchestration becomes…

Srinivasa Addepalli 4 years ago

Is x86 still relevant in the datacenter space?

Mark Laurence 6 years ago

Supercharging KVM Performance: A Deep Technical Dive…

Joshua Reuben 8 months ago

KVM is complex enough to emulate this invept instruction. The key is, in the most simplest way, KVM just removes shadow EPT when invept is executed, and rebuilds it through EPT violations that will be caused again while executing the nested VM. By doing so, the updated L1 EPT will appropriately be reflected in the rebuilt shadow EPT.

The remaining part is how to optimize it. For that, KVM brings a notion of sync & unsync of shadow EPT pages---i.e., synchronized shadow EPT page means it reflects the latest L1 EPT page. When KVM creates a shadow EPT page for a specific L1 EPT page, it marks the shadow EPT page synchronized, and set the L1 EPT page read-only. Thus, KVM intercepts the control when the L1 hypervisor write-accesses the L1 EPT page, and unsynchronize the corresponding shadow EPT page. All those unsynchronized shadow EPT pages are removed when invept is executed, thereby they are repopulated from the fresh L1 EPT pages upon EPT violations. That's all. By doing so, KVM incurs minimal performance overheads. What a fascinating solution!

KVM has lots of fascinating stuffs. Maybe next time, I hope I can talk about interrupt virtualization of KVM.

[1] Ben-Yehuda, Muli, et al. "The turtles project: Design and implementation of nested virtualization." 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10). 2010.

To view or add a comment, sign in

Subtle but Intriguing Detail of Nested Virtualization

Jaewon Hur

Recommended by LinkedIn

More articles by Jaewon Hur

Others also viewed

Kubernetes for high-performance applications - part 1

Hypervisors on LinuxONE (part1)

Kubernetes: Pod Priority vs QoS

FTDv and FMCv on Hyper-V

Understanding Hypervisors: A Dive into Virtualization

Introducing In-Place Resource Resizing for Kubernetes Pods

Part 2: DPDK - user space packet optimization

Stop Autoscaling only on CPU and Memory: Why KEDA is the Missing Piece in Your Kubernetes Cluster

Explore content categories

Recommended by LinkedIn

More articles by Jaewon Hur

Fun Encounter of Interrupt and Virtualization

Introducing MumeParrot

Others also viewed

Kubernetes for high-performance applications - part 1

Hypervisors on LinuxONE (part1)

Kubernetes: Pod Priority vs QoS

FTDv and FMCv on Hyper-V

Understanding Hypervisors: A Dive into Virtualization

Introducing In-Place Resource Resizing for Kubernetes Pods

Part 2: DPDK - user space packet optimization

Stop Autoscaling only on CPU and Memory: Why KEDA is the Missing Piece in Your Kubernetes Cluster

Explore content categories