New KVM deployment bugs and recommendations (Ubuntu 14.04: qemu 2.0, libvirt 1.2.4, Linux 3.10)

New Linux KVM qemu deployment, running on Ubuntu 14.04 with Linux 3.10 kernel and openvswitch. Hardware setup is 2 SSD in RAID1, and 2 7200RPM HDD in RAID1 using mdadm. bcache is being used as the backing cache for the HDD.

Bugs

  • hv_vapic ("vapic state='on'" in libvirt) causes Windows 2008 R2 and above VMs not to boot if CPU is an Intel IvyBridge or greater (check /sys/module/kvm_intel/parameters/enable_apicv) – Redhat Bugzilla
  • Linux 3.12 or greater (Ubuntu 14.04 ships with 3.13) have issues with virtio-net NIC and TSO (RX and TX checksuming) offloading – TCP sessions can't be established across virtual machines in certain situations (think a virtual machine as a firewall) – Debian Bugreport
  • Windows virtual machines still freeze up/high latency if you use virtio NIC, this is with the latest signed drivers available from the Fedora Project
  • Still have issues with "Russian roulette" of network interfaces with openvswitch – Blog post

Recommendations

Installed Packages

System
apt-get install haveged ntp sysstat irqbalance acpid
Linux KVM, openvswitch, virt-install, virt-top
apt-get install qemu-kvm libvirt-bin virtinst virt-top openvswitch-switch sysfsutils iotop gdisk iftop
bcache
apt-get install python-software-properties
add-apt-repository ppa:g2p/storage && apt-get update && apt-get install bcache-tools

Tuning memory, scheduler I/O subsystems for Linux KVM

Taken from RHEL 6 tuned (virtual-host)

/etc/sysctl.conf
kernel.sched_min_granularity_ns=10000000
kernel.sched_wakeup_granularity_ns=15000000
vm.dirty_ratio=10
vm.dirty_background_ratio=5
vm.swappiness=10

Disable experimental virtio-net zero copy transmit

RHEL 7 has experimental_zcopytx disabled by default.

/etc/modprobe.d/vhost-net.conf
options vhost_net  experimental_zcopytx=0

Use virtio-blk for guests, and enable Multiqueue virtio-net (except Windows)

Linux KVM page describing Multiqueue

libvirt
<devices>
  <interface type='network'>
    <model type='virtio'/>
    <driver name='vhost' queues='4'/>
  </interface>
</devices>

Where number of queues is equal to the number of virtual processors assigned to the virtual machine. Don't forget to enable the vhost_net kernel module, edit /etc/default/qemu-kvm and set VHOST_NET_ENABLED=1.

Make sure to enable Multiqueue support in the guest

ethtool -L eth0 combined 4

Use deadline scheduler, and enable transparent hugepages for KVM

/etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="elevator=deadline transparent_hugepage=always"

Don't forget to run update-grub to make the changes persistent.

For Windows guests, take advantage of Hyper-V enlightments and use e1000 Ethernet adapter

Linux KVM presentation on Hyper-V enlightenment (slightly outdated)

  • hv_vapic (for "supported processors") for Virtual APIC
  • hv_time (aka "hypervclock") for TSC invariant timestamps passed to guest
  • hv_relaxed to prevent BSOD under high load (when a timer can't be serviced when expected)
  • hv_spinlocks let's the guest know when a virtual processor is trying to acquire a lock on the same resource as another processor
libvirt
<features>
  <acpi/>
  <apic/>
  <hyperv>
    <relaxed state='on'/>
    <vapic state='on'/>
    <spinlocks state='on' retries='4096'/>
  </hyperv>
</features>
<clock offset='localtime'>
  <timer name='hypervclock' present='yes'/>
  <timer name='hpet' present='no'/>
</clock>

Build and install longterm Linux 3.10 kernel for stability (and working openvswitch with virtio-net)

apt-get -y install build-essential
cd /usr/local/src
wget https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.10.44.tar.xz
tar -Jxf linux-3.10.44.tar.xz
cd linux-3.10.44
cp /boot/config-`uname -r` .config
make olddefconfig
make -j`nproc` INSTALL_MOD_STRIP=1 deb-pkg
dpkg -i ../*.deb
apt-mark hold linux-libc-dev

Time keeping is king on FreeBSD – TSC and "how not to have time go backwards in guest"

/etc/sysctl.conf
kern.timecounter.hardware=ACPI-fast
/boot/loader.conf
virtio_load="YES"
virtio_pci_load="YES"
virtio_blk_load="YES"
if_vtnet_load="YES"
virtio_balloon_load="YES"
kern.timecounter.smp_tsc="1"
kern.timecounter.invariant_tsc="1"
libvirt
<clock offset='localtime'>
  <timer name='rtc' tickpolicy='catchup'/>
  <timer name='pit' tickpolicy='delay'/>
  <timer name='hpet' present='no'/>
</clock>

openvswitch and libvirt: vnet port "russian roulette" on restart (solution)

Update: This issue has been resolved in libvirt 1.2.7 release, or commit. The below instructions are no longer required if your distribution has updated the package.

libvirt has openvswitch integration. When a virtual machine is started that is using openvswitch for the network port, a vnetX interface is created (where X is an incremental number, from 0) on start and destroyed on shutdown by libvirt. openvswitch's configuration is persistent, being that the vnetX interface created by libvirt is saved to a database and will be available on the following reboot.

As outlined in my bug report submitted in September 2013, this quickly breaks down if libvirtd is shutdown after openvswitch because libvirt can't delete the port it's created or the machine is restarted/shutdown incorrectly. If you have virtual machines that are on different VLANs, or interfaces you can quickly have them being assigned to the wrong virtual machine as libvirt doesn't error out if the interface already exists when it tries to create it (imagine swapping around LAN and WAN ports on a firewall.)

I solved this by adding creating an upstart job override on the Ubuntu LTS releases in /etc/init/openvswitch-switch.override:

post-start script
    ovs-vsctl show | grep 'Port \"vnet[0-9]*\"' | awk -F\" {'print $2'} | xargs -I {} ovs-vsctl del-port {} || :
end script

I've tested this issue and proven it's existence in OpenSuSE 12.3 (Dartmouth), Debian (stable) and Ubuntu 12.04/14.04 (LTS) distributions.

I/O caching under QEMU KVM virtualization on Linux

Caching modes in QEMU

Mode Host page cache Guest disk write cache
none off on
writethrough on off
writeback on on
unsafe on ignored

Considerations

  • device.virtio-disk0.config-wce=off (qemu) or config-wce=off (libvirt) prevents guest from setting the write cache
  • Use cache=none for local RAW storage, cache=writethrough for NFS/iSCSI backed storage

Networking with a gateway not on the local subnet on NetBSD at OVH

NetBSD has a FAQ for networking that outlines how to do Networking with a gateway not on the local subnet, unfortunately the recipe that they provide doesn't actually work "in the real world." The route command they provide does not make the network stack send an ARP who-has for the IP address and requires that you statically set the MAC address of the gateway.

I figured out a work-around for this, based on some insight from people on the NetBSD tech-talk mailing list. This allows you to use NetBSD as a guest operating system on providers such as OVH and Hetzner:

# ifconfig fxp0 inet 10.0.0.1 
# route add -net 192.168.0.1/32 -cloning -link fxp0 -iface 
# route add default -ifa 10.0.0.1 192.168.0.1

The trick was to specify use route cloning, and use a net definition instead of a host definition. Now NetBSD will send an ARP who-has request for the gateway IP address.

To supplement the OVH bridge client guide that is available on their Wiki, it would fit into the following template:

# ifconfig fxp0 inet Fail.over.IP netmask 255.255.255.255 broadcast Fail.over.IP 
# route add -net Your.Server.IP.254/32 -cloning -link fxp0 -iface 
# route add default -ifa Fail.over.IP Your.Server.IP.254

This should allow you to use NetBSD as a guest and not get blocked by OVH robots that check for too many ARP requests.

What Linux/*BSD distributions have Syncookies enabled by default?

In light of the recently published article on Quick Blind TCP Connection Spoofing with SYN Cookies, I wanted to see what operating systems and distributions have Syncookies enabled by default.

Distribution Sysctl Default
Ubuntu Linux 12.04 net.ipv4.tcp_syncookies On
Debian Linux 6 Off
Debian Linux 7 On
CentOS 5 On
CentOS 6 On
FreeBSD 8 net.ipv4.tcp_syncookies On
Solaris 10 Not Implemented Off
OpenBSD 5.3 Not Implemented Off

I'm not sure that turning off Syncookies is the best idea, due to the potential DoS effects from disabling them – applications should use something besides IP addresses for authentication.

KVM PCI Passthrough of an AHCI SATA controller to a guest causing data corruption

I recently migrated from VMware ESXi to Linux KVM, where I was using PCI Passthrough under VMware ESXi to pass through an Intel AHCI SATA controller to a guest. I implemented the same setup by enabling IOMMU on the KVM host, and passed through the AHCI SATA controller to the guest.

After a week or two, I started seeing the following messages in /var/log/syslog on the guest:

Aug  6 13:25:28 yama kernel: [78351.258573] XFS (md0): Corruption detected. Unmount and run xfs_repair
Aug  6 13:25:28 yama kernel: [78351.259102] XFS (md0): Corruption detected. Unmount and run xfs_repair
Aug  6 13:25:28 yama kernel: [78351.259616] XFS (md0): metadata I/O error: block 0x31214bd0 ("xfs_trans_read_buf_map") error 117 numblks 16
Aug  6 13:25:28 yama kernel: [78351.260203] XFS (md0): xfs_imap_to_bp: xfs_trans_read_buf() returned error 117.
Aug  6 13:29:10 yama kernel: [78573.533933] XFS (md0): Invalid inode number 0xfeffffffffffffff
Aug  6 13:29:10 yama kernel: [78573.533940] XFS (md0): Internal error xfs_dir_ino_validate at line 160 of file /build/buildd/linux-lts-raring-3.8.0/fs/xfs/xfs_dir2.c.  Caller 0xffffffffa045cd96
Aug  6 13:29:10 yama kernel: [78573.533940]
Aug  6 13:29:10 yama kernel: [78573.538440] Pid: 1723, comm: kworker/0:1H Tainted: GF            3.8.0-27-generic #40~precise3-Ubuntu
Aug  6 13:29:10 yama kernel: [78573.538443] Call Trace:
Aug  6 13:29:10 yama kernel: [78573.538496]  [<ffffffffa042316f>] xfs_error_report+0x3f/0x50 [xfs]
Aug  6 13:29:10 yama kernel: [78573.538537]  [<ffffffffa045cd96>] ? __xfs_dir2_data_check+0x1e6/0x4a0 [xfs]
Aug  6 13:29:10 yama kernel: [78573.538560]  [<ffffffffa045a150>] xfs_dir_ino_validate+0x90/0xe0 [xfs]
Aug  6 13:29:10 yama kernel: [78573.538579]  [<ffffffffa045cd96>] __xfs_dir2_data_check+0x1e6/0x4a0 [xfs]
Aug  6 13:29:10 yama kernel: [78573.538598]  [<ffffffffa045d0ca>] xfs_dir2_data_verify+0x7a/0x90 [xfs]
Aug  6 13:29:10 yama kernel: [78573.538637]  [<ffffffff810135aa>] ? __switch_to+0x12a/0x4a0
Aug  6 13:29:10 yama kernel: [78573.538664]  [<ffffffffa045d195>] xfs_dir2_data_reada_verify+0x95/0xa0 [xfs]
Aug  6 13:29:10 yama kernel: [78573.538675]  [<ffffffff8108e2aa>] ? finish_task_switch+0x4a/0xf0
Aug  6 13:29:10 yama kernel: [78573.538697]  [<ffffffffa042133f>] xfs_buf_iodone_work+0x3f/0xa0 [xfs]
Aug  6 13:29:10 yama kernel: [78573.538706]  [<ffffffff81078c21>] process_one_work+0x141/0x490
Aug  6 13:29:10 yama kernel: [78573.538710]  [<ffffffff81079be8>] worker_thread+0x168/0x400
Aug  6 13:29:10 yama kernel: [78573.538714]  [<ffffffff81079a80>] ? manage_workers+0x120/0x120
Aug  6 13:29:10 yama kernel: [78573.538721]  [<ffffffff8107f0f0>] kthread+0xc0/0xd0
Aug  6 13:29:10 yama kernel: [78573.538726]  [<ffffffff8107f030>] ? flush_kthread_worker+0xb0/0xb0
Aug  6 13:29:10 yama kernel: [78573.538730]  [<ffffffff816fc6ac>] ret_from_fork+0x7c/0xb0
Aug  6 13:29:10 yama kernel: [78573.538735]  [<ffffffff8107f030>] ? flush_kthread_worker+0xb0/0xb0

I initially used xfs_repair on the file system, thinking that the issue was caused by a number of power failures that happened when the machine was running ESXi. However, this did not resolve the issue and made the problem worse. Eventually I decided that I wanted to scrap the file system, and pulled a drive from the array to backup the data and re-create the file system.

The drive that I pulled from the array for backups started showing the same issues with XFS corruption.

After further investigation via trial-and-error, I determined that KVM PCI Passthrough was causing the issue and decided to just pass through an array to the guest using vrtio-block – This solved the corruption problem and I haven't had any issues (knock on wood) since!

Accessing USB devices as non-root: writing udev rules the easy way

I recently purchased a TEMPered USB thermometer, which I wanted to use as non-root using an open source utility called TEMPered. All the recipes I found, required that I use root to access the /dev/hidraw0 device that the particular TEMPered USB device exposed – of course this was not acceptable.

systemd (and udev, in general – I believe) has a handy utility called udevadm. You can use this tool to query a device on your system, for example:

udevadm info --query=all --name=/dev/hidraw0 --attribute-walk

Which allows you to retrieve all the required attributes to craft a file to put in /etc/dev/rules.d. I have created the following to expose PCsensor TEMPerV1.4 to a user that is part of the group temper:

# TEMPer1.4 USB thermometer
SUBSYSTEM=="hidraw", ATTRS{idVendor}=="0c45", ATTRS{idProduct}=="7401", GROUP="temper", MODE="0660"

I placed this in a file called /etc/udev/rules.d/60-temper.rules. You can now use TEMPered as a non-root user, which is a member of the group in question!

Hetzner Online requests copy of passport, or ID card for VPS and dedicated server orders

I signed up for a VPS at Hetzner Online to use as a secondary name server for my hosting. I provided them with valid personal information for signup, and opted to pay using PayPal. I received the following email from them:

Dear Mr. Kieser,

thank you very much for your order!

Since you're a new customer with Hetzner, we ask you for a scan of your passport or ID card (authenticity check).
It's only necessary for your first order with us.

Please send the scan by fax or as an email attachment.

We are going to save the document submitted for a period of 3 weeks.

Sincerely yours,

Hetzner Online AG

Considering that their services have been compromised, and their users data has been copied – would you provide a copy of your passport to them? I certainly would not. I responded asking them to either cancel my order, or accept my S/MIME signature – which has been verified by a certificate authority as me being me.

Experience migrating from VMware ESXi to KVM in a production environment

My notes from setting up a production KVM environment, after migrating from VMware ESXi 5.1 to Ubuntu 12.04.2 64-bit with Linux Kernel 3.2 and QEMU 1.4.2, and open vSwitch.

General

Disk I/O throughput and performance characteristics

  • Always use LVM backed storage (which is aligned), with cache='none' and io='native' (aio) for guests. Disabling cache allows the host system to properly schedule disk reads and writes
  • Use deadline I/O scheduler for host systems, and vm.swappiness = 0 in on host or equivalent to reduce pressure on I/O resources and make use of host memory
  • Use virtio for bus type to allow direct access to storage instead of going through QEMU, if supported by guest operating system drivers

Processor

  • Pass through CPU flags to guest to take advantage of newer instruction sets, assuming host hardware is the same or migration is not going to be used (-cpu host)

Network

  • Use virtio Network adapters (except with Windows) to realise full throughput and lower latency on guest operating systems, where support is available (Linux 2.6+, FreeBSD)
  • Load vhost_net kernel module on host, which permits direct access to network devices skipping QEMU (libvirt will detect if vhost_net is enabled, and add vhost=on to qemu command line by default)

Software

  • Linux Kernel 3.5, distributed with Ubuntu 12.04.2 does not support building open-vswitch – you must install Kernel 3.5 for the DKMS to properly build
  • Build QEMU from source to include new functionality and Hyper-V enhancements for Windows guests, using the 1.4 stable branch – 1.5 does not work with libvirt due to the way QEMU help parameters are parsed by the library

Guest

Linux

  • vm.swappiness = 0

FreeBSD

  • vm.defer_swapspace_pageouts = 1
  • kern.timecounter.hardware=ACPI-fast
  • kern.timecounter.smp_tsc="1"
  • kern.timecounter.invariant_tsc="1"
  • Use prebuilt virtio drivers, or compile from ports under emulators/virtio-kmod after each system upgrade

Windows

  • Disable HPET via qemu (-no-hpet) or libvirt configuration, force use of TSC to reduce time drift in guest
  • If you are using qemu 1.4 or greater, enable CPU flags (Hyper-V shims) hv_vapic,hv_relaxed,hv_spinlocks=0xffff on top of disabling HPET
  • Install Memory Ballooning service and drivers, SCSI (virtio) drivers using the stable branch from Redhat
  • Use e1000 Ethernet for Windows 2008 R2 to avoid high latency/freezing of guest operating system

libvirt

  • Add xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0' to domain type, if you are going to add custom qemu command args
  • If you are using Ubuntu, and want to change the version of qemu you are going to use – you will either need to disable AppArmor, or update the profile to include the directory you've installed the alternative qemu version to

Migration

  • Make sure to uninstall VMware Tools in the guest environment after you have migrated, and install acpid on the guest to allow graceful shutdowns (if Linux)
  • Take a snapshot of the virtual machine – transfer the VMDK, and use a utility such as vmdksync to merge the deltas after shutting down the VM for the file migration to reduce downtime
  • For Windows VMs, make sure to add a dummy virtio SCSI and Ethernet device so you can install drivers and then switch the root drive to virtio

Sophos AntiVirus (SAVDI) and amavisd-new for AntiVirus on email

Update: Based on an email I received, I've updated this post with more relevant information regarding setting up SAVDI with amavisd-new.

I recently migrated to using Postfix with amavisd-new on Ubuntu Linux, and was looking at integrating Sophos AntiVirus with amavisd-new. amavisd-new shipping with the LTS release of Ubuntu is 2.6.5, which does not include SPPP functionality for communicating with savdi so you must use Sophie protocol.

The following components were used for setting up this functionality with amavisd-new from MySophos Download & Updates:

This post is assuming that you have setup amavisd-new on your system, and have it integrated with Postfix or equivalent MTA already.

savdid.conf:

channel {
commprotocol {
type: UNIX
socket: /var/run/savdid/savdid.sock
user: amavis
group: amavis
requesttimeout: 120
sendtimeout: 2
recvtimeout: 5
}

scanprotocol {
type: SOPHIE
allowscandir: SUBDIR
maxscandata: 500000
maxmemorysize: 250000
tmpfilestub: /tmp/savid_tmp
}

scanner {
type: SAVI
inprocess: YES
maxscantime: 3
maxrequesttime: 10
deny: /dev
deny: /home
savigrp: GrpArchiveUnpack 0
savigrp: GrpInternet 1
savists: Xml 1
}
}

This should permit amavisd-new to communicate with the SAVDID interface. Don't forget to create the appropriate init.d script to start savdid on boot and make sure to create the /var/run/savdid directory – as Debian/Ubuntu clean /var/run on system startup. Please download the init script from here and place it in /etc/init.d/savdid with an executable bit.

amavis communicates with SAVI using the Sophie protocol, to enable this support in amavis-new edit /etc/amavis/conf.d/15-av_scanners and add the following lines:

  ['Sophie',
    \&ask_daemon, ["{}/\n", '/var/run/savdid/savdid.sock'],
    qr/(?x)^ 0+ ( : | [\000\r\n]* $)/m,  qr/(?x)^ 1 ( : | [\000\r\n]* $)/m,
    qr/(?x)^ [-+]? \d+ : (.*?) [\000\r\n]* $/m ],

Please be aware that the socket line is being pointed at the place where we have SAVI listening for connections, based on our previous post.