Implement a x64 virtual machine to your needs...

Manuel Mendonça, Phd

Published Dec 2, 2017

Since the early days of college I was interested about the implementation of machines that are able to run other machines. I was fascinated by both the concept of matrioskas and the need to really master something up to the point to make it work. My first implementation of a virtual machine was back to the 90’s when I’ve successfully implemented an 8051 virtual processor architecture. I was thrilled by the idea of being able to run 8051 compiled code in a PC (80486 at the time). Years after, by the time I was doing my PhD research, I faced the challenge to figure out a way to locate errors and vulnerabilities in windows device drivers. I still remember as if it was today: I was waiting for a flight to my country in Charles De Gaulle airport when I realised that the only way to be successful was if I could control the execution of the device driver code - this meant to build an x64 virtual machine.

I have considered using existing virtual platforms such as Bochs or QEmu, but at the end, I’ve opted to develop my own emulation platform because of the complexity of stripping out all the unnecessary components (e.g., BIOS, IO Bus, bridges) to execute the code. Instead of investing time in understanding how these architectures would fit my needs and the required changes to such systems, I’ve built something more suitable for my needs. So I grabbed by laptop, opened Visual Studio and start coding a virtual platform.

My virtual machine implements a simplified x86-64 platform. It follows a modified Harvard architecture, which contains a processing unit (T_CPU) with an arithmetic logic unit and processor registers, a control unit with an instruction register and program counter, a memory (T_Memory) to store both data and instructions and input and output mechanisms, a hardware stack to support push, pop, call and ret instructions, OS Emulator and a Loader Component to load drivers into the platform.

Next I describe the T_Memory, T_CPU, Hardware Stack, OS Emulator and Binary Loader components.

T_Memory

In an x86-64 conventional system, the memory can be considered as an array of consecutive cells distinguished from each other by their address location. During the code execution, the CPU reads the memory contents pointed by the instruction register, decodes the instruction and executes the associated algorithm. In CICS CPU architectures, as it happens in x86-64, instructions may occupy more than one memory cell which may require the CPU to perform multiple memory accesses to complete the execution of one instruction.

In my implementation, I’ve have defined the T_Memory structure to represent the executable memory of the platform. Each T_Memory cell contains the address where the first byte of the machine instruction would be positioned in conventional memory, the assembly instruction already decrypted in text format, the number of parameters of that instruction and the characterisation of each parameter in the instruction.

typedef struct {

um64 address;

char asmInstruction[50];

char byteCodes[20];

int nbrParams;

TTValue param[MAX_PARAM];

int execCounter;

} T_Memory;

During the loading process of a binary file into the platform, the binary code is pre-processed and transformed into assembler instructions using NASM. Then, a representation of that information is stored in T_Memory cells.

This organisation was followed for the following main reasons: it reduces the efforts on interpretation of the CPU instruction set during code analysis and code emulation execution; it maintains a metadata structure about each instruction, parameters and number of executions; it can detect attempts of executing different instructions sequences as a result of landing in the middle of variable sized instructions (a technique used to exploit the architecture of CISC architecture).

I am aware that from the point of view of memory space efficiency, T_Memory is by far less efficient than the conventional x86-64 memory organisation, but my objectives are quite different from just running the executable program. Additionally, since the average dimension of the code that I am about to run is usually small, the use of such memory organisation is consequently not a concern.

T_CPU

The T_CPU is an emulation of an x64 CPU architecture organised in two main components. The first component is a “C” structure where each field holds the status of the individual T_CPU registers, something like this:

typedef struct {

um64 rax, rbx, rcx,rdx,r8,r9,...//registers

um64 cpuflags; //flags

um64 rbp,rsp; //stack pointers

um64 rdi,rsi,rip; //index registers

int cpuMode; //operation mode

...

} T_CPU;

The second component is the Instruction Execution Engine (IEE) that implements the T_CPU internal mechanics and the machine instructions according to the algorithms of the various instructions. The instructions follow the descriptions found in “Intel 64 and IA-32 Architectures Software Developer’s Manual, Vol 2 (2A,2B & 2C): Instruction Set Reference, A-Z”, and although the current instruction set is not complete (for instance MMX instructions were not implemented) they have been proved to be enough to execute the off-the-shelf device drivers for my experiments.

In each step, the IEE uses the value of the instruction pointer rip register to locate the next instruction stored in a T_Memory cell and execute the machine instruction algorithm.

Hardware Stack

In most computer architectures, a Hardware Stack is an area of the computer memory with a fixed origin and variable size that is involved in the execution of functions, transport of parameters and allocation of local variables of functions. In my platform, the Hardware Stack is implemented detached from the executable memory hold by T_Memory cells. The Hardware Stack is simply an array of bytes managed through the rsp and rbp registers of the DCPU. Similar to what happens with a conventional x86-64 architecture, the Hardware Stack gives support to push, pop, call and ret instructions.

OS Emulator

The OS Emulator (OSE) is the functional interface to the device drivers. The OSE is: i) in charge of loading the device driver and maintain data structures that support the application execution, ii) provide all the imported functions called by the driver, and iii) call the device driver entry points using the appropriate function signatures and parameters. These three main tasks are provided respectively by the Loader, the Windows Function Emulator and the Driver Manager components. The following sub sections explains the implementation of these components and the role they perform in the overall architecture.

Loader

In the Windows OS, the executable file contains the machine instructions that must be loaded in memory to execute and control the device. It follows the PEF format for the file structure, which includes in a single file the machine code of the application and dependences from other software modules organised in the form of tables. The imported functions table contains the name of the functions and the name of the external modules (DLLs or other software modules) from which the code depends. The OS uses this information during the software loading process to link the code to other software modules necessary for correct execution. In some cases, the required modules may not yet be present in the system. When this happens, the OS has to perform additional loadings that may result in some kind of recursion process.

In the virtual machine, the Loader is the component responsible for the loading process of device drivers in the virtual machine. The loading process is performed in two phases:

• Phase 1 – File read and preparation: The Loader reserves temporary regular memory space reads the driver file to that memory. Following the specification of the PEF format, the Loader interprets the contents of the temporary memory and locates the various sections. The code section is prepared for execution by fixing the relocation addresses and linking the imported functions discriminated in the import section table to the functions provided by the Windows Function Emulator. At the end of phase 1, the Loader has an image of the driver loaded into temporary memory where all imported functions used are already linked to the functions provided by the framework;

• Phase 2 – Building the executable memory contents: The Loader walks through the temporary memory to disassemble the machine instructions in the code section. For each instruction, the Loader allocates and builds a T_Memory memory cell with the corresponding metadata. During this process, the existing internal functions are identified by matching the processed instructions with the prologue and epilogue machine instruction sequences that form the start and end of functions. To complement this identification, call instructions are interpreted and the destination address identified. Destination addresses embedded in the instruction (e.g., call dword [dword 0x0800ABCD]) are easy to check for either an internal function or an imported function. New previously unidentified internal functions are then dynamically formed. Indirect calls (e.g., call esi) should be checked later.

Windows Function Emulator

A Windows device driver depends on functions provided by the OS. These functions are described in the DDK, and form the API provided by the OS to device drivers. They are used, for instance, to request and free resources from the OS and to perform various other operations.

In the virtual platform the Windows Function Emulator (WFE) is the module that implements the functions listed in the “.import” section of the device driver.

The WFE defines the TFuncTranslation structure to establish the correspondence between the name of an imported function (fxName) and the address of the corresponding function implemented at the WFE (*_My_fxAddr). Other attributes such as the calling convention (callingConvention) and the number of parameters (nbrParams) of the function are also represented in the structure. All Windows functions implemented in WFE are arranged in an array of TFuncTranslation elements which is used during the linkage process of the DUT to correctly locate the address of imported function and connect the imported functions of the application with the virtual platform.

Emulation Execution Mechanisms

This section describes a few mechanisms that glue all the components of the framework enabling the execution of the code in the virtual platform.

Execution Context Switch

The virtual platform has two main modes of operation distinguished by the code that is being executed. The virtual platform is running in emulation mode when a driver function (loaded in the virtual platform) is being executed. The virtual platform is running in true mode when a driver function calls a WFE function. Whenever a change from true mode to emulation mode occurs (and vice versa) it is said that an execution context switch has occurred.

Calling DD Interface Functions

Device drivers comply with a defined structure and the DriverEntry function is the entry point to the driver code. The device driver exposes other functions either by filling in the address of the call-back functions in the DRIVER_OBJECT data structure (when DriverEntry returns) or by registering call-back functions to the OS using appropriate registration functions, such as, NdisMRegisterMiniportDriver in the case of a NDIS device drivers.

The Driver Manager is the component of platform that directly calls the device driver interface functions. When a device driver function is called, there is a switch on the execution mode of virtual machine from true mode to emulation mode. The switching algorithm can be described as follows:

• Determine which function of the device driver to call and obtain the signature of the DD function;

• Prepare the parameter values and pass the parameters to the virtual machine according the type of execution platform (i.e., 32 bit or 64 bit);

• Force the return address in the Hardware Stack to a Driver Manager function, ensuring that when the device driver function ends the device driver switches the context of the virtual machine to true mode in a controlled way;

• Setup the rip register value of the T_CPU to the address of the device driver function to be executed;

• Enter into emulation mode by transferring the execution control to the T_CPU with a call to cpu_run() function.

The virtual platform continues to run in emulation mode until one of the following events occurs:

• The device driver calls a WFE function;

• The device driver code execution finishes by returning the execution to the address of the Driver Manager entry function;

• A flaw is detected by one of the validators during the computation of a binary instruction.

Executing WFE Functions

The virtual platform executes the driver code in emulation mode. Whenever a jmp, call or ret instruction targets the address of a WFE function, the execution of the DEM changes from emulation mode to true mode. The algorithm of this context switch is implemented at the cpu_step function and can be described as follows:

Obtain the next instruction address and verify if it refers to a T_Memory cell

In the affirmative case, continue the execution at that address;
Otherwise, verify if the address belongs to a WFE function. In this case perform the context switch by calling cpu_executeWFEFunction; else, raise a flaw exception.

Returning Control to Driver Manager

Under normal circumstances, when the execution of a device driver function ends it returns control to the virtual platform. On the contrary, if a flaw is detected the emulation (or the execution of a WFE function) will end because a fault event has been triggered.

Jerónimo Ferreira 8y

Very nicely written article and what an interesting topic! I had a lot of fun reading it. Keep up the excellent work!

Antonio Botelho 8y

I always loved discuss these matters with you because you could always breakdown to something simple that I could understand. Even though this is not my area you've kept me interested. Nice article! Warms regards from you friend 🙂

See more comments

To view or add a comment, sign in

Implement a x64 virtual machine to your needs...

Manuel Mendonça, Phd

More articles by Manuel Mendonça, Phd

Others also viewed

Mutexes and it's implementation

HPC can be deceptively simple - An RAND() tale.

🚀 Optimizing JVM with G1 Garbage Collector (G1GC)

Threads vs. Processes: A Comprehensive Comparison and Their Relationship with Multi-Core CPUs

Unleashing Application Performance: Intel® VTune™ Profiler's Advanced Analysis and Optimization Capabilities

CPU usage while iterating in C#

DOUBLE-EDGED SWORD OF PERFORMANCE

Computing Power

CPU intensive Date/Time Calculations

Explore content categories