27 Nov 2016

QEMU Bridge network study -- inside source code

文章欢迎转载,但转载时请保留本段文字,并置于文章的顶部
Author: derek0883@gmail.com
本文原文地址:http://www.gorecursion.com/virtualization/2016/11/27/net3.html

previous post QEMU Bridge network study, I have setup a bridge on host os, the bridge has 4 member ports , 1st one is physical port eth0, and virtual NIC port tap0 and tap1, another one is virtual NIC br0, it connect to host os. bridge works as same same L2 switch in “real world”. I’m using this setup to study related source code.

This experiment will send ping request from a physical PC which connected to physical device eth0 to QEMU guest OS’s eth0. Following 2 secion, 1. Receiving path means: outside PC to guest OS. 2. Transmit path means: guest OS to outside PC.

Receiving path

  1. Physical NIC(eth0) in host OS received the icmp request.
  2. Because eth0 is a member port of bridge, So it will pass the packet to bridge module. When we add the port(eth0) to bridge(br0), kernel will call br_add_if, it will regist Rx handler: br_handle_frame.
// net/bridge/br_if.c
int br_add_if(struct net_bridge *br, struct net_device *dev)
{
	err = netdev_rx_handler_register(dev, br_handle_frame, p);
}

After that when eth0 received packet, it will call br_handle_frame, br_handle_frame process incoming packets as same as L2 switch in “real world”.

  1. In this example, it will call br_forward, to forward the packets based on its MAC learning table, to tap0
/**
 * br_forward - forward a packet to a specific port
 * @to: destination port
 * @skb: packet being forwarded
 * @local_rcv: packet will be received locally after forwarding
 * @local_orig: packet is locally originated
 *
 * Should be called with rcu_read_lock.
 */
void br_forward(const struct net_bridge_port *to,
        struct sk_buff *skb, bool local_rcv, bool local_orig)
{
    if (to && should_deliver(to, skb)) {
        if (local_rcv)
            deliver_clone(to, skb, local_orig);
        else
            __br_forward(to, skb, local_orig);
        return;
    }

    if (!local_rcv)
        kfree_skb(skb);
}
  1. br_forward will call dev_queue_xmit to send out the packet through virtual NIC device tap0, put it to tap0’s Tx Queue.
  2. Now the packets has been passed to tap moudle.
// drivers/net/tun.c
static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
{
    if (skb_queue_len(&tfile->socket.sk->sk_receive_queue) * numqueues
              >= dev->tx_queue_len)
        goto drop;
}
  1. Now the packet has been transmited through virtual NIC device tap0. It just put this packet to socket Rx Queue.
  2. Remember My setup, tap0 connected to QEMU’s guest os device eth0. This is done by QEMU. In my QEMU User mode network post, I already learned net_client_init_fun in QEMU source code net/net.c. when start QEMU whis this cmdline: “-netdev tap,id=net0,ifname=tap0 -device e1000,netdev=net0” QEMU will call net_init_tap, it will open /dev/net/tun device, and wait on NIC device “tap0”. Now step 6, put packet to tap0’s Rx Queue, QEMU’s reading thread will be wake up.
// net/net.c
static int (* const net_client_init_fun[NET_CLIENT_DRIVER__MAX])(
    const Netdev *netdev,
    const char *name,
    NetClientState *peer, Error **errp) = {
        [NET_CLIENT_DRIVER_NIC]       = net_init_nic,
#ifdef CONFIG_SLIRP
        [NET_CLIENT_DRIVER_USER]      = net_init_slirp,
#endif
        [NET_CLIENT_DRIVER_TAP]       = net_init_tap,
  1. QEMU’s reading thread will receive the packets, and call tap_send to put the packet into virtual NIC(e1000 in my step) device’s Rx queue. Then e1000’s Rx handler will receive the packets. finally guest OS got the packets.

Transmit path

  1. Guest OS call dev_queue_xmit to send packet to to e1000 NIC.
  2. QEMU call tap_receive -> tap_write_packet to send that packet to tap0 device on Host.
  3. Host kernel will call tun_chr_write_iter -> netif_rx_ni. Now virtual NIC device tap0 received the packets.
  4. As tap0 is member of bridge br0, Rx handler br_handle_frame will be called, bridge will forward the packet as same as L2 switch in real world.
  5. The packet will be send out through physical device eth0.
  6. Physical PC connect to eth0 will received the packet.

code tracing

I’m going to using SystemTap to trace the function call. SystemTap is a great tool to study kernel code. Check SystemTap Beginners Guide

Run following script with “sudo stap -g bridgeHook.stp” It will be converted to C, then compiled as a kernel module, and “patch” kernel function. e.g. when kernel call

static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
{
}

After the bridgeHook.stp installed, it will call the hook function before original tun_net_xmit called. So it will print dev->name and srcMac -> dstMac, without rebuild and reboot kernel, I can modify kernel code run time, and observe behavior.

// bridgeHook.stp
#! /usr/bin/env stap 
%{
#include <linux/skbuff.h>
%}
function getMac:string(pkt : long)
%{
    struct sk_buff *skb;
    unsigned char *p;
    skb = (struct sk_buff *)STAP_ARG_pkt;
    p = skb_mac_header(skb);
    snprintf(STAP_RETVALUE, MAXSTRINGLEN, "%02x:%02x:%02x:%02x:%02x:%02x -> %02x:%02x:%02x:%02x:%02x:%02x",
            p[6],p[7],p[8],p[9],p[10],p[11], p[0],p[1],p[2],p[3],p[4],p[5]);
%}
probe kernel.function("netif_receive_skb_internal") {
    printf("netif_receive_skb_internal dev: %s %s\n", kernel_string($skb->dev->name), getMac($skb));
}

probe kernel.function("tun_net_xmit") {
    printf("tun_net_xmit dev: %s %s\n", kernel_string($dev->name), getMac($skb));
}

probe kernel.function("netif_rx_ni") {
    printf("netif_rx_ni dev: %s %s\n", kernel_string($skb->dev->name), getMac($skb));
}

probe module("bridge").function("br_forward") {
    printf("br_forward to dev: %s %s\n", kernel_string($to->dev->name), getMac($skb));
}
  1. 52:54:00:12:34:57 is MAC address of eth0 in Guest OS.
  2. 00:25:9c:19:b8:e9 is MAC address of outside PC.

With bridgeHook.stp running, I start ping from Guest OS to outside PC connect to physical port eth0. I can see:

// Guest OS Tx to outside PC. 
netif_rx_ni dev: tap1 52:54:00:12:34:57 -> 00:25:9c:19:b8:e9
br_forward to dev: eth0 52:54:00:12:34:57 -> 00:25:9c:19:b8:e9

// Outside PC Tx to Guest OS. 
netif_receive_skb_internal dev: eth0 00:25:9c:19:b8:e9 -> 52:54:00:12:34:57
br_forward to dev: tap1 00:25:9c:19:b8:e9 -> 52:54:00:12:34:57
tun_net_xmit dev: tap1 00:25:9c:19:b8:e9 -> 52:54:00:12:34:57