26 Nov 2016

Device IO space emulation

文章欢迎转载,但转载时请保留本段文字,并置于文章的顶部
Author: derek0883@gmail.com
本文原文地址:http://www.gorecursion.com/virtualization/2016/11/26/qemu-devio1.html

In “real world”, CPU IO read/write operation will access to physical device, how QEMU handle IO read/write access in “virtual world”? I know how PCI works in kernel, so I start explore this question from PCI config read/write.

Concept

QEMU will creat some virtual device for Guest OS, The device I/O space register are actually a piece of memory inside Host OS allocate by QEMU. When Guest OS try to read/write those I/O space register, e.g. with special CPU instruction in/out on X86. how QEMU handle that access? How to redirect it to corresponding virtual device’s memory access. The answer agian is TCG. You should have read my previous post before you read this one.

  1. TCG frontend
  2. TCG backend

The idea is, Virtual device driver will allocate a piece of memory to emulate I/O registers. When guest os try to excute in/out instruction TCG will translate it to memory access. Read/Write to the that piece of memeory, IN/OUT instruction handler will call virtual device driver’s callback function. Callback function then will response to the instruction.

Device IO space

I already know on X86 platform, CPU detect PCI device by accessing 0xCF8/0xCFC through IN/OUT instruction. In previous post TCG frontend and TCG backend, I found out for x86, the “in 0xcf8” instruction will transfered by TCG. and it is easy to find the TCG backend helper function is:

target_ulong helper_inl(CPUX86State *env, uint32_t port)
{
#ifdef CONFIG_USER_ONLY
    fprintf(stderr, "inl: port=0x%04x\n", port);
    return 0;
#else
    return address_space_ldl(&address_space_io, port,
                             cpu_get_mem_attrs(env), NULL);
#endif
}

Start QEUM with a x86_64 linux, ./x86_64-softmmu/qemu-system-x86_64 ./images/cirros-d161007-x86_64-disk.img, from dmesg I know the pci host controller is piix, with gdb I found this device was crated in pc_init1:

/* PC hardware initialisation */
static void pc_init1(MachineState *machine,
                     const char *host_type, const char *pci_type)
{
	...
    if (pcmc->pci_enabled) {
        pci_bus = i440fx_init(host_type,
                              pci_type,
                              &i440fx_state, &piix3_devfn, &isa_bus, pcms->gsi,
                              system_memory, system_io, machine->ram_size,
                              pcms->below_4g_mem_size,
                              pcms->above_4g_mem_size,
                              pci_memory, ram_memory);
        pcms->bus = pci_bus;
	}
	...
}

The I/O address 0xCF8 and 0xCFC was registered in hw/pci-host/piix.c

static void i440fx_pcihost_initfn(Object *obj)
{
    PCIHostState *s = PCI_HOST_BRIDGE(obj);

    memory_region_init_io(&s->conf_mem, obj, &pci_host_conf_le_ops, s,
                          "pci-conf-idx", 4);
    memory_region_init_io(&s->data_mem, obj, &pci_host_data_le_ops, s,
                          "pci-conf-data", 4);
}

From docs/memory.txt, I know pci_host_config_(read/write) is the handler for this address.

// hw/pci/pci_host.c
const MemoryRegionOps pci_host_conf_le_ops = {
    .read = pci_host_config_read,
    .write = pci_host_config_write,
    .endianness = DEVICE_LITTLE_ENDIAN,
};

Now I’m ready to restart GDB.

$ gdb ./x86_64-softmmu/qemu-system-x86_64
Reading symbols from ./x86_64-softmmu/qemu-system-x86_64...done.

(gdb) b pci_host_config_read
(gdb) r images/cirros-d161007-x86_64-disk.img -monitor stdio -D /tmp/qemu.log -d in_asm,op,out_asm

Thread 5 "qemu-system-x86" hit Breakpoint 1, pci_host_config_read (opaque=0x5555569e46b0, addr=0, len=4) at hw/pci/pci_host.c:138
138	    PCIHostState *s = opaque;
(gdb) bt
#0  pci_host_config_read at hw/pci/pci_host.c:138
#1  0x00005555557ad949 in memory_region_read_accessor at qemu-2.8.0-rc0/memory.c:435
#2  0x00005555557adf58 in access_with_adjusted_size at qemu-2.8.0-rc0/memory.c:592
#3  0x00005555557b02e2 in memory_region_dispatch_read1 at qemu-2.8.0-rc0/memory.c:1242
#4  0x00005555557b03e7 in memory_region_dispatch_read at qemu-2.8.0-rc0/memory.c:1273
#5  0x000055555575dd06 in address_space_ldl_internal at qemu-2.8.0-rc0/exec.c:3094
#6  0x000055555575ddfa in address_space_ldl at qemu-2.8.0-rc0/exec.c:3133
#7  0x000055555589dc81 in helper_inl at qemu-2.8.0-rc0/target-i386/misc_helper.c:85
#8  0x00007fffea02a22a in code_gen_buffer ()
#9  0x0000555555762921 in cpu_tb_exec at qemu-2.8.0-rc0/cpu-exec.c:164
#10 0x00005555557635c3 in cpu_loop_exec_tb at qemu-2.8.0-rc0/cpu-exec.c:544
#11 0x0000555555763856 in cpu_exec at qemu-2.8.0-rc0/cpu-exec.c:638
#12 0x0000555555792825 in tcg_cpu_exec at qemu-2.8.0-rc0/cpus.c:1117
#13 0x0000555555792a69 in qemu_tcg_cpu_thread_fn at qemu-2.8.0-rc0/cpus.c:1197
#14 0x00007ffff65746fa in start_thread (arg=0x7fffe9318700) at pthread_create.c:333
#15 0x00007ffff62aab5d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Now it is very clear that when GuestOS code like this: tmp = inl(0xCF8), the binary instruction(inl 0xCF8)will be translated by TCG.

target_ulong helper_inl(CPUX86State *env, uint32_t port)
{
#ifdef CONFIG_USER_ONLY
    fprintf(stderr, "inl: port=0x%04x\n", port);
    return 0;
#else
    return address_space_ldl(&address_space_io, port,
                             cpu_get_mem_attrs(env), NULL);
#endif
}

A close look at /tmp/qemu.log

----------------
IN: 
0x00000000000edbd9:  push   %ebx
0x00000000000edbda:  mov    %edx,%ebx
0x00000000000edbdc:  mov    %edx,%ecx
0x00000000000edbde:  and    $0xfc,%ecx
0x00000000000edbe4:  or     $0x80000000,%ecx
0x00000000000edbea:  movzwl %ax,%eax
0x00000000000edbed:  shl    $0x8,%eax
0x00000000000edbf0:  or     %eax,%ecx
0x00000000000edbf2:  mov    $0xcf8,%edx    # 0xcf8 => dx
0x00000000000edbf7:  mov    %ecx,%eax
0x00000000000edbf9:  out    %eax,(%dx)     # outl
0x00000000000edbfa:  and    $0x2,%ebx
0x00000000000edbfd:  lea    0xcfc(%ebx),%edx
0x00000000000edc03:  in     (%dx),%ax       # inl(0xCF8) 
0x00000000000edc05:  pop    %ebx
0x00000000000edc06:  ret

OUT: [size=492]
0x7fffea02a1e0:  mov    -0x8(%r14),%ebp	0x418b6ef8
0x7fffea02a1e4:  test   %ebp,%ebp	0x85ed
0x7fffea02a1e6:  jne    0x7fffea02a28d	0x0f85a1000000
0x7fffea02a1ec:  mov    $0xcf8,%ebp	0xbdf80c0000
0x7fffea02a1f1:  mov    %rbp,0x10(%r14)	0x49896e10
0x7fffea02a1f5:  mov    $0x80000000,%ebx	0xbb00000080
0x7fffea02a1fa:  mov    %rbx,(%r14)	0x49891e
0x7fffea02a1fd:  mov    %r14,%rdi	0x498bfe
0x7fffea02a200:  mov    %ebp,%esi	0x8bf5
0x7fffea02a202:  mov    %ebx,%edx	0x8bd3
0x7fffea02a204:  mov    $0x55555589dc0f,%r10	0x49ba0fdc895555550000
0x7fffea02a20e:  callq  *%r10	0x41ffd2        # call outl here
0x7fffea02a211:  mov    0x10(%r14),%rbp	0x498b6e10
0x7fffea02a215:  movzwl %bp,%ebp	0x0fb7ed
0x7fffea02a218:  mov    %r14,%rdi	0x498bfe
0x7fffea02a21b:  mov    %ebp,%esi	0x8bf5
0x7fffea02a21d:  mov    $0x55555589dc4d,%r10	0x49ba4ddc895555550000
0x7fffea02a227:  callq  *%r10	0x41ffd2       # call inl here

At the end of above code segment, it set r10 to 0x55555589dc4d, then call this address, Now go back to gdb.

(gdb) info line *0x55555589dc4d
Line 80 of "qemu-2.8.0-rc0/target-i386/misc_helper.c" 
   starts at address 0x55555589dc4d <helper_inl>
   and ends at 0x55555589dc5c <helper_inl+15>.

Device memory space

Now I know IO space accessing works in “virtual world”. It is not hard to find out how device memory accessing works in “virtual world”. By default, QEMU will create NIC card e1000.

static const MemoryRegionOps e1000_mmio_ops = {
    .read = e1000_mmio_read,
    .write = e1000_mmio_write,
    .endianness = DEVICE_LITTLE_ENDIAN,
    .impl = {
        .min_access_size = 4,
        .max_access_size = 4,
    },
};

static void
e1000_mmio_setup(E1000State *d)
{
    int i;
    const uint32_t excluded_regs[] = {
        E1000_MDIC, E1000_ICR, E1000_ICS, E1000_IMS,
        E1000_IMC, E1000_TCTL, E1000_TDT, PNPMMIO_SIZE
    };

    memory_region_init_io(&d->mmio, OBJECT(d), &e1000_mmio_ops, d,
                          "e1000-mmio", PNPMMIO_SIZE);
    memory_region_add_coalescing(&d->mmio, 0, excluded_regs[0]);
    for (i = 0; excluded_regs[i] != PNPMMIO_SIZE; i++)
        memory_region_add_coalescing(&d->mmio, excluded_regs[i] + 4,
                                     excluded_regs[i+1] - excluded_regs[i] - 4);
    memory_region_init_io(&d->io, OBJECT(d), &e1000_io_ops, d, "e1000-io", IOPORT_SIZE);
}

Now got back to GDB.

(gdb) clear
(gdb) b e1000_io_read
Breakpoint 4 at 0x5555559c1bfa: file hw/net/e1000.c, line 1321.
(gdb) c
Continuing.

Thread 5 "qemu-system-x86" hit Breakpoint 1, e1000_mmio_read at hw/net/e1000.c:1287
1287	    E1000State *s = opaque;
(gdb) bt
#0  e1000_mmio_read at hw/net/e1000.c:1287
#1  0x00005555557ad949 in memory_region_read_accessor at qemu-2.8.0-rc0/memory.c:435
#2  0x00005555557adf58 in access_with_adjusted_size at qemu-2.8.0-rc0/memory.c:592
#3  0x00005555557b02e2 in memory_region_dispatch_read1 at qemu-2.8.0-rc0/memory.c:1242
#4  0x00005555557b03e7 in memory_region_dispatch_read at qemu-2.8.0-rc0/memory.c:1273
#5  0x00005555557b6261 in io_readx at qemu-2.8.0-rc0/cputlb.c:515
#6  0x00005555557b7af5 in io_readl at qemu-2.8.0-rc0/softmmu_template.h:104
#7  0x00005555557b7c9c in helper_le_ldul_mmu at qemu-2.8.0-rc0/softmmu_template.h:141
#8  0x00007fffebaf2743 in code_gen_buffer ()
#9  0x0000555555762921 in cpu_tb_exec at qemu-2.8.0-rc0/cpu-exec.c:164
#10 0x00005555557635c3 in cpu_loop_exec_tb at qemu-2.8.0-rc0/cpu-exec.c:544
#11 0x0000555555763856 in cpu_exec qemu-2.8.0-rc0/cpu-exec.c:638
#12 0x0000555555792825 in tcg_cpu_exec at qemu-2.8.0-rc0/cpus.c:1117
#13 0x0000555555792a69 in qemu_tcg_cpu_thread_fn at qemu-2.8.0-rc0/cpus.c:1197
#14 0x00007ffff65746fa in start_thread at pthread_create.c:333
#15 0x00007ffff62aab5d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

The concepts are same in both case, but the code inside GuestOS are different, this is why previous one call helper_inl, second one call helper_le_ldul_mmu.

Kernel source code for 1st case:

// file: arch/x86/pci/direct.c
static int pci_conf1_read(unsigned int seg, unsigned int bus,
              unsigned int devfn, int reg, int len, u32 *value)
{
...
    outl(PCI_CONF1_ADDRESS(bus, devfn, reg), 0xCF8);
    switch (len) {
    case 1:
        *value = inb(0xCFC + (reg & 3));
        break;
    case 2:
        *value = inw(0xCFC + (reg & 2));
        break;
    case 4:
        *value = inl(0xCFC);
        break;
    }
...
}

Kernel source code for 2nd case:

// file: drivers/net/ethernet/intel/e1000/e1000_main.c
static irqreturn_t e1000_intr(int irq, void *data)
{
...
    u32 icr = er32(ICR);      
...
}