Subtle but Intriguing Detail of Nested Virtualization
Nested virtualization is a cool feature of Linux KVM that can theoretically host infinite layers of nested VMs. KVM provides a virtualization so that the hypervisor running in the VM doesn't even know it's running in a VM, and thinks as if it is hosting a VM on a bare metal machine. Imagine a Matrix in a Matrix. What a cool technology? Nested virtualization was first introduced by Turtles project [1], and now, fully functioning in KVM.
While I studied KVM source code for a while, still there are lots of parts that I cannot understand. To remember what I have learned, I wrote an article about a small interesting implementation detail of nested virtualization: nested virtual memory management.
As you may know, hypervisor uses Second Level Address Translation (SLAT) to translate the Guest Physical Address (GPA) to Host Physical Address (HPA). In Intel, Extended Page Table (EPT) is used for this SLAT. Once the hypervisor builds the EPT and registers it to the CPU, MMU seamlessly translates the GPA to HPA while the guest VM is running.
However, this becomes quite complicated when it comes to the nested virtualization. The hypervisor running in VM (i.e., L1 hypervisor) will build the EPT in its own physical address space, and tries to run the nested VM (i.e., L2 guest) using it. Thus, that EPT actually translates the nested VM's physical address to the physical address of L1 VM (i.e., L2 GPA to L1 GPA), and it cannot work on bare metal. Let me call this EPT a L1 EPT.
To handle this case, KVM (i.e., L0 hypervisor) internally builds a new EPT for the nested VM (i.e., shadow EPT). This EPT translates the L2 GPA to HPA so that MMU can seamlessly handle the memory operations of L2 VM. Upon an EPT violation caused by a missing EPT entry, KVM walks the (virtually) registered L1 EPT by L1 hypervisor, finds the related entry (which translates L2 GPA to L1 GPA), then searches the corresponding HPA and populates the shadow EPT entry (i.e., L2 GPA to HPA).
Finally, I've come to what I really want to say. Then, how would be the invept instruction (i.e., invalidate EPT) emulated? Invept instruction flushes all cached EPT translation results in hardware buffer (i.e., EPT TLB) so that the MMU walks the (possibly modified) EPT again. How should we emulate when a L1 hypervisor executes an invept instruction? Naively executing invept in the L0 hypervisor again doesn't work, because the MMU will only walk the shadow EPT again, but not the modified L1 EPT.
Recommended by LinkedIn
KVM is complex enough to emulate this invept instruction. The key is, in the most simplest way, KVM just removes shadow EPT when invept is executed, and rebuilds it through EPT violations that will be caused again while executing the nested VM. By doing so, the updated L1 EPT will appropriately be reflected in the rebuilt shadow EPT.
The remaining part is how to optimize it. For that, KVM brings a notion of sync & unsync of shadow EPT pages---i.e., synchronized shadow EPT page means it reflects the latest L1 EPT page. When KVM creates a shadow EPT page for a specific L1 EPT page, it marks the shadow EPT page synchronized, and set the L1 EPT page read-only. Thus, KVM intercepts the control when the L1 hypervisor write-accesses the L1 EPT page, and unsynchronize the corresponding shadow EPT page. All those unsynchronized shadow EPT pages are removed when invept is executed, thereby they are repopulated from the fresh L1 EPT pages upon EPT violations. That's all. By doing so, KVM incurs minimal performance overheads. What a fascinating solution!
KVM has lots of fascinating stuffs. Maybe next time, I hope I can talk about interrupt virtualization of KVM.
[1] Ben-Yehuda, Muli, et al. "The turtles project: Design and implementation of nested virtualization." 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10). 2010.