Running macOS as guest in kvm.

There are various approaches to run macOS as guest under kvm. One is to add apple-specific features to OVMF, as described by Gabriel L. Somlo. I’ve choose to use the Clover EFI bootloader instead. Here is how my setup looks like.

What you needed

First a bootable installer disk image. You can create a bootable usbstick using the createinstallmedia tool shipped with the installer. You can then dd the usb stick to a raw disk image.

Next a clover disk image. I’ve created a script which uses guestfish to generate a disk image from a clover iso image, with a custom config file. The advantage of having a separate disk only for clover is that you can easily update clover, downgrade clover and tweak the clover configuration without booting the virtual machine. So, if something goes wrong recovering is a lot easier.

Qemu. Version 2.10 (or newer) strongly recommended. macOS versions up to 10.12.3 work fine in qemu 2.9. macOS 10.12.4 requires fixes for the qemu applesmc emulation which got merged for qemu 2.10.

OVMF. Latest git works fine for me. Older OVMF versions trip over a recent qemu update and provides broken ACPI tables to the OS then. With the result that macOS doesn’t boot, even though ovmf itself shows no signs of trouble.

Configuring your virtual machine

Here are snippets of my libvirt config, with comments explaining the important things:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>

The xmlns:qemu entry is needed for qemu-specific tweaks, that way we can ask libvirt to add extra arguments to the qemu command line.

  <os>
    <type arch='x86_64' machine='pc-q35-2.9'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram template='/usr/share/edk2.git/ovmf-x64/OVMF_VARS-pure-efi.fd'>/var/lib/libvirt/qemu/nvram/macos-test-org-base_VARS.fd</nvram>
    <bootmenu enable='yes'/>
  </os>

Using the q35 machine type here, and the cutting edge edk2 builds from my firmware repo.

  <cpu mode='custom' match='exact' check='partial'>
    <model fallback='allow'>Penryn</model>
    <feature policy='require' name='invtsc'/>
  </cpu>

CPU. Penryn is known-good. The invtsc feature is needed because macOS uses the TSC for timekeeping. When asking to provide a fixed TSC frequency qemu will store the TSC frequency in a hypervisor cpuid leaf. And macOS will pick it up there. Without that macOS does a wild guess, likely gets it very wrong, and wall clock in your guest runs either way too fast or way too slow.

    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
      <source dev='/dev/path/to/lvm/volume'/>
      <target dev='sda' bus='sata'/>
    </disk>

This is the system disk where macOS is be installed on. Attached as sata disk to the q35 ahci controller.

You also need the installmedia image. On a real macintosh you’ll typically use a usbstick. macOS doesn’t care much where the installer is stored though, so you can attach the image as sata disk too and it’ll work fine.

Finally you need the clover disk image. edk2 alone isn’t able to boot from the system or installer disk, so it’ll start clover. clover in turn will load the hfs+ filesystem driver so the other disks can be accessed, will offer a boot menu and allows to boot macOS.

    <interface type='network'>
      <source network='default'/>
      <model type='e1000-82545em'/>
    </interface>

Ethernet card. macOS has drivers for this model. Seems to have problems with link detection now and then. Set link status to down for a moment, then to up again (using virsh domif-setlink) gets the virtual machine online.

    <input type='tablet' bus='usb'/>
    <input type='keyboard' bus='usb'/>

USB tablet and keyboard, as input devices. Tablet allows to operate the mouse without pointer grabs which is much more convenient than using a virtual mouse.

    <video>
      <model type='vga' vram='65536'/>
    </video>

Qemu standard vga.

  <qemu:commandline>
    <qemu:arg value='-readconfig'/>
    <qemu:arg value='/path/to/macintosh.cfg'/>
  </qemu:commandline>

This is the extra configuration item for stuff not supported by libvirt. The macintosh.cfg file looks like this, adding the emulated smc device:

[device "smc"]
driver = "isa-applesmc"
osk = "<insert-real-osk-here>"

You can run Gabriels SmcDumpKey tool on a macintosh to figure what the osk is.

Configuring clover

I’m using config.plist.stripped.qemu as starting point. Here are the most important settings:

Boot/Arguments
Kernel command line. There are lots of options you can start the kernel with. You might want try "-v" to start the kernel in verbose mode where it prints boot messages to the screen (like the linux kernel without "quiet"). For trouble-shooting, or to impress your friends.
Boot/DefaultVolume
Name of the volume clover should boot from by default. Put your system disk name here, otherwise clover will wait forever for you to pick a boot menu item.
GUI/ScreenResolution
As the Name says, the Display Resolution. Note that OVMF has a Display Resolution Setting too. Hit ESC at the splash screen to enter the Setup, then go to Device Manager / OVMF Platform Configuration / Change Preferred Resolution. The two Settings must match, otherwise macOS will boot with a scrambled display.

Go!

That should be it. Boot the virtual machine. Installing and using macOS should work as usual.

Final comment

Gabriel also has some comments on the legal side of this. Summary: Probably allowed by Apple on macintosh hardware, i.e. when running linux on your macintosh, then run macOS as guest there. If in doubt check with your lawyer.

Fresh Fedora 26 images uploaded

Fedora 26 is out of the door, and here are fresh fedora 26 images.

There are raspberry pi images. The aarch64 images requires a model 3, the armv7 image boots on both 2 and 3 models. Unlike the images for the previous fedora releases the new images use the standard fedora kernels instead of a custom kernel. So, the kernel update service for the older images will stop within the next weeks.

There are efi images for qemu. The i386 and x86_64 images use systemd-boot as bootloader. grub2 doesn’t work due to bug 1196114 (unless you create a boot menu entry manually in uefi setup). The arm images use grub2 as bootloader. armv7 isn’t supported by systemd-boot in the first place. The aarch64 versions throws an exception. The efi images can also be booted as container, using "systemd-nspawn --boot --image <file>", but you have to convert them to raw first.

The images don’t have a root password. You have to set one using "virt-customize -a <image> --root-password "password:<secret>", otherwise you can’t login after boot.

The images have been created with imagefish.

meson experiments

Seems the world of build systems is going to change. Traditional approach is make. Pimped up with autoconf and automake. But some newcomers provide new approaches to building your projects.

First, there is ninja-build. It’s a workhorse, roughly comparable to make. It isn’t really designed to be used standalone though. Typically the lowlevel ninja build files are generated by some highlevel build tool, similar to how Makefiles are generated by autotools.

Second, there is meson, a build tool which (on unix) by default uses ninja as backend. meson appears to become pretty popular.

So, lets have a closer look at it. I’m working on drminfo right now, a tool to dump information about drm devices, which also comes with a simple test tool, rendering a test image to the display. It is pretty small, doesn’t even use autotools, perfect for trying out something new. Also nice for this post as the build files are pretty small.

So, here is the Makefile:

CC      ?= gcc
CFLAGS  ?= -Os -g -std=c99
CFLAGS  += -Wall

TARGETS := drminfo drmtest gtktest

drminfo : CFLAGS += $(shell pkg-config --cflags libdrm cairo pixman-1)
drminfo : LDLIBS += $(shell pkg-config --libs libdrm cairo pixman-1)

drmtest : CFLAGS += $(shell pkg-config --cflags libdrm gbm epoxy cairo cairo-gl pixman-1)
drmtest : LDLIBS += $(shell pkg-config --libs libdrm gbm epoxy cairo cairo-gl pixman-1)
drmtest : LDLIBS += -ljpeg

gtktest : CFLAGS += $(shell pkg-config --cflags gtk+-3.0 cairo pixman-1)
gtktest : LDLIBS += $(shell pkg-config --libs gtk+-3.0 cairo pixman-1)
gtktest : LDLIBS += -ljpeg

all: $(TARGETS)

clean:
        rm -f $(TARGETS)
        rm -f *~ *.o

drminfo: drminfo.o drmtools.o
drmtest: drmtest.o drmtools.o render.o image.o
gtktest: gtktest.o render.o image.o

Thanks to pkg-config there is no need to use autotools just to figure the cflags and libraries needed, and the Makefile is short and easy to read. The only thing here you might not be familiar with are target specific variables.

Now, compare with the meson.build file:

project('drminfo', 'c')

# pkg-config deps
libdrm_dep    = dependency('libdrm')
gbm_dep       = dependency('gbm')
epoxy_dep     = dependency('epoxy')
cairo_dep     = dependency('cairo')
cairo_gl_dep  = dependency('cairo-gl')
pixman_dep    = dependency('pixman-1')
gtk3_dep      = dependency('gtk+-3.0')

# libjpeg dep
jpeg_dep      = declare_dependency(link_args : '-ljpeg')

drminfo_srcs  = [ 'drminfo.c', 'drmtools.c' ]
drmtest_srcs  = [ 'drmtest.c', 'drmtools.c', 'render.c', 'image.c' ]
gtktest_srcs  = [ 'gtktest.c', 'render.c', 'image.c' ]

drminfo_deps  = [ libdrm_dep, cairo_dep, pixman_dep ]
drmtest_deps  = [ libdrm_dep, gbm_dep, epoxy_dep,
                  cairo_dep, cairo_gl_dep, pixman_dep, jpeg_dep ]
gtktest_deps  = [ gtk3_dep,
                  cairo_dep, cairo_gl_dep, pixman_dep, jpeg_dep ]

executable('drminfo',
           sources      : drminfo_srcs,
           dependencies : drminfo_deps)
executable('drmtest',
           sources      : drmtest_srcs,
           dependencies : drmtest_deps)
executable('gtktest',
           sources      : gtktest_srcs,
           dependencies : gtktest_deps,
           install      : false)

Pretty straight forward translation. So, what are the differences?

First, meson and ninja have built-in support a bunch of features. No need to put anything into your build files to use them, they are just there:

  • Automatic header dependencies. When a header file changes all source files which include it get rebuilt.
  • Automatic build system dependencies: When meson.build changes the ninja build files are updated.
  • Rebuilds on command changes: When the build command line for a target changes the target is rebuilt.

Sure, you can do all that with make too, the linux kernel build system does it for example. But then your Makefiles will be a order of magnitude larger than the one shown above, because all the clever stuff is in the build files instead of the build tool.

Second meson keeps the object files strictly separated by target. The project has some source files shared by multiple executables. drmtools.c for example is used by both drminfo and drmtest. With the Makefile above it get build once. meson builds it separately for each target, with the cflags for the specific target.

Another nice feature is that ninja automatically does parallel builds. It figures the number of processors available and runs (by default) that many jobs.

Overall I’m pretty pleased, I’ll probably use meson more frequently in the future. If you want try it out too I’d suggest to start with the tutorial.

Fedora 25 images for qemu and raspberry pi 3 uploaded

I’ve uploaded three new images to https://www.kraxel.org/repos/rpi2/images/.

The fedora-25-rpi3 image is for the raspberry pi 3.
The fedora-25-efi images are for qemu (virt machine type with edk2 firmware).

The images don’t have a root password set. You must use libguestfs-tools to set the root password …

virt-customize -a <image> --root-password "password:<your-password-here>>

… otherwise you can’t login after boot.

The rpi3 image is partitioned simliar to the official (armv7) fedora 25 images: The firmware is on a separate vfat partition for only firmware and uboot (mounted at /boot/fw). /boot is a ext2 filesystem now and holds the kernels only. Well, for compatibility reasons with the f24 images (all images use the same kernel rpms) firmware files are in /boot too, but they are not used. So, if you want tweak something in config.txt, go to /boot/fw not /boot.

The rpi3 images also have swap is commented out in /etc/fstab. The reason is that the swap partition must be reinitialized, because swap partitions generated when running on a 64k pages kernel (CONFIG_ARM64_64K_PAGES=y) are not compatible with 4k pages (CONFIG_ARM64_4K_PAGES=y). This must be fixed, by running “swapon --fixpgsz <device>” once, then you can uncomment the swap line in /etc/fstab.

local display for intel vgpu starts working

Intel vgpu guest display shows up in the qemu gtk window. This is a linux kernel booted to the dracut emergency shell prompt, with the framebuffer console running @ inteldrmfb. Booting a full linux guest with X11 and/or wayland not tested yet. There are rendering glitches too, running “dmesg” looks like this:

So, a bunch of issues to solve before this is ready for users, but it’s a nice start.

For the brave:

host kernel: https://www.kraxel.org/cgit/linux/log/?h=intel-vgpu
qemu: https://www.kraxel.org/cgit/qemu/log/?h=work/intel-vgpu

Take care, both branches are moving targets (aka: rebasing at times).

tweak arm images with libguestfs-tools

So, when using the official fedora arm images on your raspberry pi (or any other arm board) board you might have faced the problem that it is not easy to use them for a headless (i.e. no keyboard and display connected) machine. There is no default password, fedora asks you to set one on the first boot instead. Which is from a security point of view surely better than shipping with a fixed password. But for headless machines it is quite inconvenient …

Luckily there is an easy way out. You can use libguestfs-tools. The tools have been created to configure virtual machine images (this is where the name comes from). But the tools work fine with sdcards too.

I’m using a usb sdcard reader which shows up as /dev/sdc on my system. I can just pass /dev/sdc as image to the tools (take care, the device is probably something else for you). For example, to set a root password:

virt-customize -a /dev/sdc --root-password "password:<your-password-here>"

The initial setup on the first boot is a systemd service, and it can be turned off by simply removing the symlinks which enable the service:

virt-customize -a /dev/sdc \
  --delete /etc/systemd/system/multi-user.target.wants/initial-setup.service \
  --delete /etc/systemd/system/graphical.target.wants/initial-setup.service

You can use virt-copy-in (or virt-tar-in) to copy config files to the disk image. Small (or empty) configuration files can also be created with the write command:

virt-customize -a /dev/sdc --write "/.autorelabel:"

Adding the .autorelabel file will force selinux relabeling on the first boot (takes a while). It is a good idea to do that in case you copy files to the sdcard, to make sure the new files are labeled correctly. Especially in case you copy security sensitive things like ssh keys or ssh config files. Without relabeling selinux will not allow sshd access those files, which in turn can break remote logins.

There is alot more the virt-* tools can do for you. Check out the manual pages for more info. And you can easily script things, virt-customize has a --commands-from-file switch which accepts a file with a list of commands.

virtual gpu support landing upstream

The upstreaming process of virtual gpu support (vgpu) made a big step forward with the 4.10 merge window. Two important pieces have been merged:

First, the mediated device framework (mdev). Basically this allows kernel drivers to present virtual pci devices, using the vfio framework and interfaces. Both nvidia and intel will use mdev to partition the physical gpu of the host into multiple virtual devices which then can be assigned to virtual machines.

Second, intel landed initial mdev support for the i915 driver too. There is quite some work left to do in future kernel releases though. Accessing to the guest display is not supported yet, so you must run x11vnc or simliar tools in the guest to see the screen. Also there are some stability issues to find and fix.

If you want play with this nevertheless, here is how to do it. But be prepared for crashes and better don’t try this on a production machine.

On the host: create virtual devices

On the host machine you obviously need a 4.10 kernel. Also the intel graphics device (igd) must be broadwell or newer. In the kernel configuration enable vfio and mdev (all CONFIG_VFIO_* options). Enable CONFIG_DRM_I915_GVT and CONFIG_DRM_I915_GVT_KVMGT for intel vgpu support. Building the mtty sample driver (CONFIG_SAMPLE_VFIO_MDEV_MTTY, a virtual serial port) can be useful too, for testing.

Boot the new kernel. Load all modules: vfio-pci, vfio-mdev, optionally mtty. Also i915 and kvmgt of course, but that probably happened during boot already.

Go to the /sys/class/mdev_bus directory. This should look like this:

kraxel@broadwell ~# cd /sys/class/mdev_bus
kraxel@broadwell .../class/mdev_bus# ls -l
total 0
lrwxrwxrwx. 1 root root 0 17. Jan 10:51 0000:00:02.0 -> ../../devices/pci0000:00/0000:00:02.0
lrwxrwxrwx. 1 root root 0 17. Jan 11:57 mtty -> ../../devices/virtual/mtty/mtty

Each driver with mdev support has a directory there. Go to $device/mdev_supported_types to check what kind of virtual devices you can create.

kraxel@broadwell .../class/mdev_bus# cd 0000:00:02.0/mdev_supported_types
kraxel@broadwell .../0000:00:02.0/mdev_supported_types# ls -l
total 0
drwxr-xr-x. 3 root root 0 17. Jan 11:59 i915-GVTg_V4_1
drwxr-xr-x. 3 root root 0 17. Jan 11:57 i915-GVTg_V4_2
drwxr-xr-x. 3 root root 0 17. Jan 11:59 i915-GVTg_V4_4

As you can see intel supports three different configurations on my machine. The configuration (basically the amount of video memory) differs, and the number of instances you can create. Check the description and available_instance files in the directories:

kraxel@broadwell .../0000:00:02.0/mdev_supported_types# cd i915-GVTg_V4_2
kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# cat description 
low_gm_size: 64MB
high_gm_size: 192MB
fence: 4
kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# cat available_instance 
2

Now it is possible to create virtual devices by writing a UUID into the create file:

kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# uuid=$(uuidgen)
kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# echo $uuid
f321853c-c584-4a6b-b99a-3eee22a3919c
kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# sudo sh -c "echo $uuid > create"

The new vgpu device will show up as subdirectory of the host gpu:

kraxel@broadwell .../mdev_supported_types/i915-GVTg_V4_2# cd ../../$uuid
kraxel@broadwell .../0000:00:02.0/f321853c-c584-4a6b-b99a-3eee22a3919c# ls -l
total 0
lrwxrwxrwx. 1 root root    0 17. Jan 12:31 driver -> ../../../../bus/mdev/drivers/vfio_mdev
lrwxrwxrwx. 1 root root    0 17. Jan 12:35 iommu_group -> ../../../../kernel/iommu_groups/10
lrwxrwxrwx. 1 root root    0 17. Jan 12:35 mdev_type -> ../mdev_supported_types/i915-GVTg_V4_2
drwxr-xr-x. 2 root root    0 17. Jan 12:35 power
--w-------. 1 root root 4096 17. Jan 12:35 remove
lrwxrwxrwx. 1 root root    0 17. Jan 12:31 subsystem -> ../../../../bus/mdev
-rw-r--r--. 1 root root 4096 17. Jan 12:35 uevent

You can see the device landed in iommu group 10. We’ll need that in a moment.

On the host: configure guests

Ideally this would be as simple as adding <hostdev> to your guests libvirt xml config. The mdev devices don’t have a pci address on the host though, and because of that they must be passed to qemu using the sysfs device path instead of the pci address. libvirt doesn’t (yet) support sysfs paths though, so it is a bit more complicated for now. Alot of the setup libvirt does for hostdevs automatically must be done manually instead.

First, we must allow qemu access /dev. By default libvirt uses control groups to restrict access. That must be turned off. Edit /etc/libvirt/qemu.conf. Uncomment the cgroup_controllers line. Remove "devices" from the list. Restart libvirtd.

Second, we must allow qemu access the iommu group (10 in my case). A simple chmod will do:

kraxel@broadwell ~# chmod 666 /dev/vfio/10

Third, we must update the guest configuration:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  [ ... ]
  <currentMemory unit='KiB'>1048576</currentMemory>
  <memoryBacking>
    <locked/>
  </memoryBacking>
  [ ... ]
  <qemu:commandline>
    <qemu:arg value='-device'/>
    <qemu:arg value='vfio-pci,addr=05.0,sysfsdev=/sys/class/mdev_bus/0000:00:02.0/f321853c-c584-4a6b-b99a-3eee22a3919c'/>
  </qemu:commandline>
</domain>

There is special qemu namespace which can be used to pass extra command line arguments to qemu. We do this here to use a qemu feature not yet supported by libvirt (use sysfs paths for vfio-pci). Also we must explicitly allow to lock down guest memory.

Now we are ready to go:

kraxel@broadwell ~# virsh start --console $guest

In the guest

It is a good idea to prepare the guest a bit before adding the vgpu to the guest configuration. Setup a serial console, so you can talk to it even in case graphics are broken. Blacklist the i915 module and load it manually, at least until you have a known-working configuration. Also booting to runlevel 3 (aka multi-user.target) instead of 5 (aka graphical.target) and starting the xorg server manually is better for now.

For the guest machine intel recommends the 4.8 kernel. In theory newer kernels should work too, in practice they didn’t last time I tested (4.10-rc2). Also make sure the xorg server uses the modesetting driver, the intel driver didn’t work in my testing. This config file will do:

root@guest ~# cat /etc/X11/xorg.conf.d/intel.conf 
Section "Device"
        Identifier  "Card0"
#       Driver      "intel"
        Driver      "modesetting"
        BusID       "PCI:0:5:0"
EndSection

I’m starting the xorg server with x11vnc, xterm and mwm (motif window manager) using this little script:

#!/bin/sh

# debug
echo "# $0: DISPLAY=$DISPLAY"

# start server
if test "$DISPLAY" != ":4"; then
        echo "# $0: starting Xorg server"
        exec startx $0 -- /usr/bin/Xorg :4
        exit 1
fi
echo "# $0: starting session"

# configure session
xrdb $HOME/.Xdefaults

# start clients
x11vnc -rfbport 5904 &
xterm &
exec mwm

The session runs on display 4, so you should be able to connect from the host this way:

kraxel@broadwell ~# vncviewer $guest_ip:4

Have fun!

raspberry pi status update

You might have noticed meanwhile that Fedora 25 ships with raspberry pi support and might have wondered what this means for my packages and images.

The fedora images use a different partition layout than my images. Specifically the fedora images have a separate vfat partition for the firmware and uboot and the /boot partition with the linux kernels lives on ext2. My images have a vfat /boot partition with everything (firmware, uboot, kernels), and the rpms in my repo will only work properly on such a sdcard. You can’t mix & match stuff and there is no easy way to switch from my sdcard layout to the fedora one.

Current plan forward:

I will continue to build rpms for armv7 (32bit) for a while for existing installs. There will be no new fedora 25 images though. For new devices or reinstalls I recommend to use the official fedora images instead.

Fedora 25 has no aarch64 (64bit) support, although it is expected to land in one of the next releases. Most likely I’ll create new Fedora 25 images for aarch64 (after final release), and of course I’ll continue to build kernel updates too.

Finally some words on the upstream kernel status:

The 4.8 dwc2 usb host adapter driver has some serious problems on the raspberry pi. 4.7 works ok, and so do the 4.9-rc kernels. But 4.7 doesn’t get stable updates any more, so I jumped straight to the 4.9-rc kernels for mainline. You might have noticed already if you updated your rpi recently. The raspberry pi foundation kernels don’t suffer from that issue as they use a different (not upstream) driver for the dwc.

advanced network booting

I use network booting a lot. Very convenient, I rarely have to fiddle with boot iso images these days. My setup has evolved over the years, and I’m going to describe here how it looks today.

general intro

First thing needed is a machine which runs the dhcp server. In a typical setups the home internet router also provides the dhcp service, but usually you can’t configure the dhcp server much. So I turned off the dhcp server in the router and configured a linux machine with dhcpd instead. These days this is a raspberry pi, running fedora 24, serving dhcp and dns.

I used to run this on my x86 server acting as NAS, but that turned out to be a bad idea. We have power failures now and then, the NAS goes check the filesystems at boot, which can easily take half an hour, and there used to be no dhcp and dns service during that time. In contrast the raspberry pi is back to service in less than a minute.

Obviously the raspberry pi itself can’t get a ip address via dhcp, it needs a static address assigned instead. systemd-networkd does the job with this ethernet.network config file:

[Match]
Name=eth0

[Network]
Address=192.168.2.10/24
Gateway=192.168.2.1
DNS=127.0.0.1
Domains=home.kraxel.org

Address=, Gateway= and Domain= must be adjusted according to your local network setup of course. The same applies to all other config file snippets following below.

dhcp setup

Ok, lets walk through the dhcpd.conf file:

option arch code 93 = unsigned integer 16;      # RFC4578

Defines option arch which will be used later.

default-lease-time 86400;                       # 1d
max-lease-time 604800;                          # 7d

With long lease times there will be less chatter in the log file, also nobody will loose network connectivity in case the dhcpd server is down for a moment (for example when fiddeling with the config and dhcpd not restarting properly due to syntax errors).

ddns-update-style none;
authoritative;

No ddns used here. I assign static ip addresses instead (more on this below).

subnet 192.168.2.0 netmask 255.255.255.0 {
        # network
        range 192.168.2.200 192.168.2.249;
        option routers 192.168.2.1;

Default ip address and network of my router (Fritzbox). The small range configured here is used for dynamic ip addresses only, the room below 200 is left to static ip addresses.

        # dns, ntp
        option domain-name "home.kraxel.org";
        option domain-name-servers 192.168.2.10;
        option ntp-servers ntp.home.kraxel.org;

Oh, right, almost forgot to mention that, the raspberry pi also runs ntp so not each and every machine has to talk to the ntp pool servers. So we announce that here (together with the dns server), so the dhcp clients can pick up that information.

Nothing tricky so far, now the network boot configuration starts:

        if (option arch = 00:0b) {
                # EFI aarch64
                next-server 192.168.2.14;
                filename "efi-a64/grubaa64.efi";

Here I use the option arch to figure what client is booting, to pick a matching boot file. This is for 64bit arm machines, loading grub2 efi binary. grub in turn will look for a grub.cfg file in the efi-a64/ directory, so placing stuff in sub-directories separates things nicely.

        } else if (option arch = 00:09) {
                # EFI x64 -- ipxe
                next-server 192.168.2.14;
                filename "efi-x64/shim.efi";
        } else if (option arch = 00:07) {
                # EFI x64 -- ovmf
                next-server 192.168.2.14;
                filename "efi-x64/shim.efi";

Same for x86_64. For some reason ovmf (with the builtin virtio-net driver) and ipxe efi drivers use different arch ids to signal x86_64. I simply list both here to get things going no matter what.

Update (Nov 9th): 7 is the correct value for x86_64. Looks like RfC 4578 got this wrong initially. There is an errata for this. ipxe is fixed meanwhile.

        } else if ((exists vendor-class-identifier) and
                   (option vendor-class-identifier = "U-Boot.armv8")) {
                # rpi 3
                next-server 192.168.2.108;
                filename "rpi3/boot.conf";

This is uboot (64bit) on a raspberry pi 3 trying to netboot. Well, any aarch64 to be exact, but I don’t own any other device. The boot.conf file is a dummy and doesn’t exist. uboot will pick up the rpi3/ sub-directory though, and after failing to load boot.conf it will try to boot pxelinux style, i.e. look for config files in the rpi3/pxelinux.cfg/ directory.

        } else if ((exists vendor-class-identifier) and
                   (option vendor-class-identifier = "U-Boot.armv7")) {
                # rpi 2
                next-server 192.168.2.108;
                filename "rpi2/boot.conf";

Same for armv7 (32bit) uboot, raspberry pi 2 in my case. raspberry pi 3 with 32bit uboot will land here too.

        } else if ((exists user-class) and ((option user-class = "iPXE") or
                                            (option user-class = "gPXE"))) {
                # bios -- gpxe/ipxe
                next-server 192.168.2.14;
                filename "http://192.168.2.14/tftpboot/pxelinux.0";

This is for ipxe/gpxe on classic bios-based x86 machines. ipxe can load files over http, and pxelinux running with ipxe as network driver can do this too. So we can specify a http url as filename, and it’ll work fine and faster than using tftp.

        } else {
                # bios -- chainload gpxe
                next-server 192.168.2.14;
                filename "gpxe-undionly.kpxe";
        }

Everything else chainloads gpxe. I think this is unused since ages, new physical machines have EFI these days and virtual machines already use gpxe/ipxe roms when running with seabios.

}
include "/etc/dhcp/home.4.dhcp.inc";
include "/etc/dhcp/xeni.4.dhcp.inc";

Here are the static ip addresses assigned. This is in a include file because these files are generated by a script. I basically have a file with a table listing hostname, mac address and ip address, and my script generates include files for dhcpd.conf and dns zone files from that. Inside the include there are lots of entries looking like this:

host photosmart {
        hardware ethernet 10:60:4b:11:71:34;
        fixed-address 192.168.2.53;
}

So all machines get a static ip address assigned, based on the mac address. Together with the dns configuration this allows to connect to the machines by hostname.

In case you want a different netboot configuration for a specific host it is also possible to add next-server and filename entries here to override the network defaults configured above.

That’s it for dhcp.

tftp setup

Ok, an tftp server is needed too. That can run on the same machine, but doesn’t has to. You can even have multiple machines serving tftp. You might have noticed that not all entries above have the same next-server. The raspberry pi entries point to my workstation where I cross-compile arm kernels now and then, so I can boot them directly. All other entries point to the NAS where all the install trees are located. Getting tftp running is easy these days:

  1. dnf install tftp-server
  2. systemctl enable tftp
  3. Place the boot files in /var/lib/tftpboot

Placing the boot files is left as an exercise to the reader, otherwise this becomes too long. Maybe I’ll do a separate post on the topic later.

For now just a general hint: in.tftpd has a --verbose switch. Turning that on and watching the log is a good way to see which files clients are trying to load. Helpful for trouble-shooting.

httpd setup

Only needed if you want boot over http as mentioned above. Unless you have apache httpd already running you have to install and enable it of course. Then you can drop this tftpboot.conf file into /etc/httpd/conf.d to make /var/lib/tftpboot available over http too:

<Directory "/var/lib/tftpboot">
        Options Indexes FollowSymLinks Includes
        AllowOverride None
        Require ip 127.0.0.1/8 192.168.0.0/16
</Directory>

Alias   "/tftpboot"      "/var/lib/tftpboot"

libvirt setup

So, of course I want netboot my virtual machines too, even if they are on a NAT-ed libvirt network. In that case they are not using the dhcp server running on my raspberry pi, but the dnsmasq server started by libvirt on the virtualization host. Luckily it is possible to configure network booting in the network configuration, like this:

<network>
  <name>default</name>
  <forward mode='nat'/>
  <ip address='192.168.123.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.123.10' end='192.168.123.99'/>
      <bootp file='http://192.168.2.14/tftpboot/pxelinux.0'/>
    </dhcp>
  </ip>
</network>

That’ll work just fine for seabios guests, but not when running ovmf. Unfortunately libvirt doesn’t support serving different boot files depending in client architecture. But there is an easy way out: Just define a separate network for ovmf guests:

<network>
  <name>efi-x64</name>
  <forward mode='nat'/>
  <ip address='192.168.132.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.132.10' end='192.168.132.99'/>
      <bootp file='efi-x64/shim.efi' server='192.168.2.14'/>
    </dhcp>
  </ip>
</network>

Ok, almost there. While the tcp-based http protocol goes through the NAT forwarding without a hitch the udp-based tftp protocol doesn’t. It needs some extra help. The nf_nat_tftp kernel module handles that. You can use the systemd module load service to make sure it gets loaded on each boot.

using qemu directly

The qemu user network stack has netboot support too, you point it to the tftp server this way:

qemu-system-x86_64 \
        -netdev user,id=net0,bootfile=http://192.168.2.14/tftpboot/pxelinux.0 \
        -device virtio-net-pci,netdev=net0 \
        [ more args follow here ]

In case the tftpboot directory is available on the local machine it is also possible to use the builtin tftp server instead:

qemu-system-x86_64 \
        -netdev user,id=net0,tftp=/var/lib/tftpboot,bootfile=pxelinux.0 \
        -device virtio-net-pci,netdev=net0 \
        [ more args follow here ]