31 Dec 2016

QEMU memory emulation

Author: derek0883@gmail.com

The basic concept of memory emulation is same as Device IO space emulation. But as Virtual Address, Physical Address, Page table and TLB get involved, it is much complicate than Device IO space emulation.

Memory emulation

There are 4 type of address: 1. Host Physical Address(HPA). 2. Host Virtual Address(HVA). 3. Guest Physical Address(GPA). 4. Guest Virtual Address(GVA).

QEMU will emulate physical memory to Guest OS, user can specify the size of physical memory with “-m” command line option. e.g. if you start QEMU with -m 128M, Then Guest OS will detect 128M physical memory.

What happen in guest OS still as same as Non virtualization system. Guest using GVA to access memory, MMU will translate GVA to GPA. But the physical memory of Guest OS allocated by QEMU from Host OS, So QEMU need translate GPA to HVA, then access HVA, then Host MMU will translate HVA to HPA.

GVA to GPA to HVA example

In order to check how Guest access memory, I made following changes for linux kernel:

diff --git a/fs/proc/cmdline.c b/fs/proc/cmdline.c
index cbd82df..c7833f0 100644
--- a/fs/proc/cmdline.c
+++ b/fs/proc/cmdline.c
@@ -5,6 +5,9 @@
 static int cmdline_proc_show(struct seq_file *m, void *v)
+       seq_printf(m, "GVA: %p, GPA: %lx, Val: %x\n", 
+               saved_command_line, __phys_addr((long unsigned int)saved_command_line),
+               *((int*)saved_command_line));
        seq_printf(m, "%s\n", saved_command_line);
        return 0;

I intentionally access saved_command_line as a integer here, to trigger a memory read access in cmdline_proc_show function.

Then Build kernel for QEMU. And boot new kernel with following command.

$ ./x86_64-softmmu/qemu-system-x86_64 -kernel images/bzImage \
    -initrd images/cirros-x86_64-initramfs -append "console=ttyS0" -nographic

After system bootup:

$ cat /proc/cmdline 
GVA: ffff880007b1e980, GPA: 7b1e980, Val: 736e6f63

As you can see, GVA of saved_command_line is 0xffff880007b1e980, GPA is 0x7b1e980, First 4 byte is 0x736e6f63(“cons”).

I have studied TCG frontend and TCG backend I know Guest memory read/write instruction will be translated to memory load/store operation, a set of helper functions. After some close study of source code, I made the following changes. Note: 0xffff880007b1e980 is the GVA address should match above output.

diff --git a/softmmu_template.h b/softmmu_template.h
index 4a2b665..d3d4175 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -168,6 +168,12 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr,
     res = glue(glue(ld, LSUFFIX), _le_p)((uint8_t *)haddr);
+#ifdef TARGET_X86_64
+    if (addr == (target_ulong)0xffff880007b1e980) 
+        printf("%s eip: %lx, vaddr: %lx haddr: %lx, Val: %x\n", 
+            __func__, (uint64_t)env->eip, addr, haddr, (uint32_t)res);
     return res;

Now rebuild and start QEMU.

$ ./x86_64-softmmu/qemu-system-x86_64 -kernel images/bzImage \
    -initrd images/cirros-x86_64-initramfs -append "console=ttyS0" -nographic

Inside Guest OS, cat /proc/cmdline, you will get:

$ cat /proc/cmdline 
helper_le_ldul_mmu eip: ffffffff8120a900, vaddr: ffff880007b1e980 haddr: 7f084311e980, Val: 736e6f63
GVA: ffff880007b1e980, GPA: 7b1e980, 736e6f63

helper_le_ldul_mmu printed by QEMU console, other two lines print by Guest OS. eip: ffffffff8120a900, Guest OS’s EIP, it is inside cmdline_proc_show function body.

$ nm ../obj/x86_64/vmlinux | grep cmdline_proc_show
ffffffff8120a900 t cmdline_proc_show

$ nm ../obj/x86_64/vmlinux | grep ffffffff8120a900
ffffffff8120a900 t cmdline_proc_show

vaddr: ffff880007b1e980 this is the address of saved_command_line in Guest OS, GPA is 0x7b1e980.

haddr: 7f084311e980, QEMU translated to HVA, QEMU will access this address, and return the value. As you can see, value is 0x736e6f63, print from two consoles are same.

Now you have got the idea, GVA to GPA, then To HVA inside QEMU’s address space.

QEMU MMU detail

Now we have figured out Guest memory access instruction will be translated by TCG, it will call helper_xx_ldxx_mmu or helper_xx_st_mmu. QEMU will emulate a MMU, it works as same as real MMU on CPU, it has TLB, Page Table, etc.

When those function get called with GVA, QEMU will check if TLB hit or miss. If TLB hit, then life is much easer, QEMU just need add an offset to GVA, then get HVA. If TLB miss, it will trigger page fault, QEMU will walk through Guest OS’s page table, Then get GPA, then will fill an TLB entry, after it returns, TLB is ready, just add an offset to GVA.

WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr,
                            TCGMemOpIdx oi, uintptr_t retaddr)
    /* If the TLB entry is for a different page, reload and try again.  */
    if ((addr & TARGET_PAGE_MASK)
         != (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
        if (!VICTIM_TLB_HIT(ADDR_READ, addr)) {
            tlb_fill(ENV_GET_CPU(env), addr, READ_ACCESS_TYPE,
                     mmu_idx, retaddr);
        tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;

    // When got here, TLB is ready. GVA + an offset is HVA.
    haddr = addr + env->tlb_table[mmu_idx][index].addend;

    // access by HVA.
#if DATA_SIZE == 1
    res = glue(glue(ld, LSUFFIX), _p)((uint8_t *)haddr);
    res = glue(glue(ld, LSUFFIX), _le_p)((uint8_t *)haddr);
#ifdef TARGET_X86_64
    if (addr == (target_ulong)0xffff880007b1e980)
        printf("%s eip: %lx, vaddr: %lx haddr: %lx, Val: %x\n",
            __func__, (uint64_t)env->eip, addr, haddr, (uint32_t)res);
    return res;

tlb_fill will call x86_cpu_handle_mmu_fault, this will walk through Guest OS’s page table, GVA to GPA, after get GPA, Then QEMU know’s the offset GPA to HVA.

void tlb_fill(CPUState *cs, target_ulong addr, MMUAccessType access_type,
              int mmu_idx, uintptr_t retaddr)
    int ret;

    ret = x86_cpu_handle_mmu_fault(cs, addr, access_type, mmu_idx);
    if (ret) {
        X86CPU *cpu = X86_CPU(cs);
        CPUX86State *env = &cpu->env;
        raise_exception_err_ra(env, cs->exception_index, env->error_code, retaddr);

The code glue(gule(ld, LSUFFIX), _p), might makes you confuse, it is just a macro,

// file: include/qemu/compiler.h

#ifndef glue
#define xglue(x, y) x ## y
#define glue(x, y) xglue(x, y)
#define stringify(s)    tostring(s)
#define tostring(s) #s

It just concatenate two string, e.g. glue(glue(ld, LSUFFIX),_p) when “call” that template, if define LSUFFIX to “ub”, then after macro process, code will be:

    haddr = addr + env->tlb_table[mmu_idx][index].addend;
    res = ldub_p((uint8_t *)haddr);

Now you have know how GVA to GPA, GPA to HVA works.

As you can see, a simple memory access instruction will be tranlated to a bunch of function call. That’s one of the motivation CPU vendor provide CPU virtualization solution. e.g. Intel’s EPT(Extended Page Table).