🇺🇦 kraxel’s news

modern uefi network booting

2024-07-09T00:00:00+02:00

Network boot kickoff.

Step number one for the firmware on any system is sending out a DHCP request, asking the DHCP server for an IP address, the boot server (called "next server" in dhcp terms) and the bootfile.

On success the firmware will contact the boot server, fetch the bootfile and hand over control to the bootfile. Traditional method to serve the bootfile is using tftp (trivial file transfer protocol). Modern systems support http too. I have an article on setting up the dhcp server for virtual machines you might want check out.

What the bootfile is expected to be depends on the system being booted. There are embedded systems -- for example IP phones -- which load the complete system software that way.

When booting UEFI systems the bootfile typically is an EFI binary. That is not the only option though, more on that below.

UEFI network boot with a boot loader.

The traditional way to netboot linux on UEFI systems is using a bootloader. The bootfile location handed out by the DHCP server points to the bootloader and is the first file loaded over the network. Typical choices for the bootloader are grub.efi, snponly.efi (from ipxe project) or syslinux.efi.

Next step is the bootloader fetching the config file. That works the same way the bootloader itself was loaded, using the EFI network driver provided by either the platform firmware (typically the case for onboard NICs) or via PCI option rom (plug-in NICs). The bootloader does not need its own network drivers.

The loaded config file controls how the boot will continue. This can be very simple, three lines asking the bootloader to fetch kernel + initrd from a fixed location, then start the kernel with some command line. This can also be very complex, creating an interactive menu system where the user has dozens of options to choose from (see for example netboot.xyz).

Now the user can -- in case the config file defines menus -- pick what he wants boot.

Final step is the bootloader fetching the kernel and initrd (again using the EFI network driver) and starting the kernel. Voila.

Boot loaders and secure boot.

When using secure boot there is one more intermediate step needed: The first binary needs to be be shim.efi, which in turn will download the actual bootloader. Most distros ship only grub.efi with a secure boot signature, which limits the boot loader choice to that.

Also all components (shim + grub + kernel) must come from the same distribution. shim.efi has the distro secure boot signing certificate embedded, so Fedora shim will only boot grub + kernel with a secure boot signature from Fedora.

Netbooting machines without EFI network driver.

You probably do not have to worry about this. Shipping systems with EFI network driver and UEFI network boot support is standard feature today, snponly.efi should be used for these systems.

When using older hardware network boot support might be missing though. Should that be the case the ipxe project can help because it also features a large collection of firmware network drivers. It ships an all-in-one EFI binary named ipxe.efi which includes the the bootloader and scripting features (which are in snponly.efi too) and additionally all the ipxe hardware drivers.

That way ipxe.efi can boot from the network even if the firmware does not provide a driver. In that case ipxe.efi itself must be loaded from local storage though. You can download the efi binary and ready-to-use ISO/USB images from boot.ipxe.org.

UEFI network boot with an UKI.

A UKI (unified kernel image) is an EFI binary bundle. It contains a linux kernel, an initrd, the command line and a few more optional components not covered here in sections of the EFI binary. Also the systemd efi stub, which handles booting the bundled linux kernel with the bundled initrd.

One advantage is that the secure boot signature of an UKI image will cover all components and not only the kernel itself, which is a big step forward for linux boot security.

Another advantage is that a UKI is self-contained. It does not need a bootloader which knows how to boot linux kernels and handle initrd loading. It is simply an EFI binary which you can start any way you want, for example from the EFI shell prompt.

The later makes UKIs interesting for network booting, because they can be used as bootfile too. The DHCP server hands out the UKI location, the UEFI firmware fetches the UKI and starts it. Done.

Combining the bootloader and UKI approaches is possible too. UEFI bootloaders can load not only linux kernels. EFI binaries (including UKIs) can be loaded too, in case of grub.efi with the chainloader command. So if you want interactive menus to choose an UKI to boot you can do that.

UEFI network boot with an ISO image.

Modern UEFI implementations can netboot ISO images too. Unfortunately there are a few restrictions though:

It is a relatively new feature. It exists for a few years already in edk2, but with the glacial speeds firmware feature updates are delivered (if at all) this means there is hardware in the wild which does not support this.
It is only supported for HTTP boot. Which makes sense given that ISO images can be bulky and the http protocol typically is much faster than the tftp protocol used by PXE boot. Nevertheless you might need additional setup steps because of this.

When the UEFI firmware gets an ISO image as bootfile from the DHCP server it will load the image into a ramdisk, register the ramdisk as block device and try to boot from it.

From that point on booting works the same way booting from a local cdrom device works. The firmware will look for the boot loader on the ramdisk and load it. The bootloader will find the other components needed on the ramdisk, i.e. kernel and initrd in case of linux. All without any network access.

The UEFI firmware will also create ACPI tables for a pseudo nvdimm device. That way the booted linux kernel will find the ramdisk too. You can use the standard Fedora / CentOS / RHEL netinst ISO image, linux will find the images/install.img on the ramdisk and boot up all the way to anaconda. With enough RAM you can even load the DVD with all packages, then do the complete system install from ramdisk.

The big advantage of this approach is that the netboot workflow becomes very simliar to other installation workflows. It's not the odd kid on the block any more where loading kernel and initrd works in a completely different way. The very same ISO image can be:

Burned to a physical cdrom and used to boot a physical machine.
In many cases the ISO images are hypbrid, so they can be flashed to a USB stick too for booting a physical machine.
The ISO image can be attached as virtual device to a virtual machine.
On server grade managed hardware the ISO image can be attached as virtual media using redfish and the BMC.
And finally: The ISO image can be loaded into a ramdisk via UEFI http boot.

Bonus: secure boot support suddenly isn't a headace any more.

The kernel command line.

There is one problem with the fancy new world though. We have lots of places in the linux world which depend on the linux kernel command line for system configuration. For example anaconda expects getting the URL of the install repository and the kickstart file that way.

When using a boot loader that is simple. The kernel command line simply goes into the boot loader config file.

With ISO images it is more complicated, changing the grub config file on a ISO image is a cumbersome process. Also ISO images are not exactly small, so install images with customized grub.cfg need quite some storage space.

UKIs can pass through command line arguments to the linux kernel, but that is only allowed in case secure boot is disabled. When using UKIs with secure boot the best option is to use the UKIs built and signed on distro build infrastructure. Which implies using the kernel command line for customization is not going to work with secure boot enabled.

So, all of the above (and UKIs in general) will work better if we can replace the kernel command line as universal configuration vehicle with something else. Which most likely will not be a single thing but a number of different approaches depending on the use case. Some steps into that direction did happen already. Systemd can autodetect partitions (so booting from disk without root=... on the kernel command line works). And systemd credentials can be used to configure some aspects of a linux system. There is still a loooong way to go though.

W^X in UEFI firmware and the linux boot chain.

2023-12-15T00:00:00+01:00

What is W^X?

If this sounds familiar to you, it probably is. It means that memory should be either writable ("W", typically data), or executeable ("X", typically code), but not both. Elsewhere in the software industry this is standard security practice since ages. Now it starts to take off for UEFI firmware too.

This is a deep dive into recent changes, in both code (firmware) and administration (secure boot signing), the consequences this has for the linux, and the current state of affairs.

Changes in the UEFI spec and edk2

All UEFI memory allocations carry a memory type (EFI_MEMORY_TYPE). UEFI tracks since day one whenever a memory allocation is meant for code or data, among a bunch of other properties such as boot service vs. runtime service memory.

For a long time it didn't matter much in practice. The concept of virtual memory does not exist for UEFI. IA32 builds even run with paging disabled (and this is unlikely to change until the architecture disappears into irrelevance). Other architectures use identity mappings.

While UEFI does not use address translation, nowdays it can use page tables to enforce memory attributes, including (but not limited to) write and execute permissions. When configured to do so it will set code pages to R-X and data pages to RW- instead of using RWX everywhere, so code using memory types incorrectly will trigger page faults.

New in the UEFI spec (added in version 2.10) is the EFI_MEMORY_ATTRIBUTE_PROTOCOL. Sometimes properties of memory regions need to change, and this protocol can be used to do so. One example is a self-uncompressing binary, where the memory region the binary gets unpacked to initially must be writable. Later (parts of) the memory region must be flipped from writable to executeable.

As of today (Dec 2023) edk2 has a EFI_MEMORY_ATTRIBUTE_PROTOCOL implementation for the ARM and AARCH64 architectures, so this is present in the ArmVirt firmware builds but not in the OVMF builds.

Changed secure boot signing requirements

In an effort to improve firmware security in general and especially for secure boot Microsoft changed the requirements for binaries they are willing to sign with their UEFI CA key.

One key requirement added is that the binary layout must allow to enforce memory attributes with page tables, i.e. PE binary sections must be aligned to page size (4k). Sections also can't be both writable and executable. And the application must be able to deal with data section being mapped as not executable (NX_COMPAT).

These requirements apply to the binary itself (i.e. shim.efi for linux systems) and everything loaded by the binary (i.e. grub.efi, fwupd.efi and the linux kernel).

Where does linux stand?

We had and party still have a bunch of problems in all components involved in the linux boot process, i.e. shim.efi, grub.efi and the efi stub of the linux kernel.

Some are old bugs such as memory types not being used correctly, which start to cause problems due to the firmware becoming more strict. Some are new problems due to Microsoft raising the bar for PE binaries, typically sections not being page-aligned. The latter are easily fixed in most cases, often it is just a matter of adding alignment to the right places in the linker scripts.

Lets have closer look at the components one by one:

shim.efi

shim added code to use the new EFI_MEMORY_ATTRIBUTE_PROTOCOL before it was actually implemented by any firmware. Then this was released completely untested. Did not work out very well, we got a nice time bomb, and edk2 implementing EFI_MEMORY_ATTRIBUTE_PROTOCOL for arm triggered it ...

Fixed in main branch, no release yet.

Getting new shim.efi binaries signed by Microsoft depends on the complete boot chain being compilant with the new requirements, which prevents shim bugfixes being shipped to users right now.

That should be solved soon though, see the kernel section below.

grub.efi

grub.efi used to use memory types incorrectly.

Fixed upstream years ago, case closed.

Well, in theory. Upstream grub development goes at glacial speeds, so all distros carry a big stack of downstream patches. Not surprisingly that leads to upstream fixes being absorbed slowly and also to bugs getting reintroduced.

So, in practice we still have buggy grub versions in the wild. It is getting better though.

The linux kernel

The linux kernel efi stub had it's fair share of bugs too. On non-x86 architectures (arm, riscv, ...) all issues have been fixed a few releases ago. They all share much of the efi stub code base and also use the same self-decompressing method (CONFIG_EFI_ZBOOT=y).

On x86 this all took a bit longer to sort out. For historical reasons x86 can't use the zboot approach used by the other architectures. At least as long as we need hybrid BIOS/UEFI kernels, which most likely will be a number of years still.

The final x86 patch series has been merged during the 6.7 merge window. So we should have a fixed stable kernel in early January 2024, and distros picking up the new kernel in the following weeks or months. Which in turn should finally unblock shim updates.

There should be enough time to get everything sorted for the spring distro releases (Fedora 40, Ubuntu 24.04).

edk2 config options

edk2 has a bunch of config options to fine tune the firmware behavior, both compile time and runtime. The relevant ones for the problems listed above are:

PcdDxeNxMemoryProtectionPolicy

Compile time option. Use the --pcd switch for the edk2 build script to set these. It's bitmask, with one bit for each memory type, specifying whenever the firmware shoud apply memory protections for that particular memory type, by setting the flags in the page tables accordingly.

Strict configuration is PcdDxeNxMemoryProtectionPolicy = 0xC000000000007FD5. This is also the default for ArmVirt builds.

Bug compatible configuration is PcdDxeNxMemoryProtectionPolicy = 0xC000000000007FD1. This excludes the EfiLoaderData memory type from memory protections, so using EfiLoaderData allocations for code will not trigger page faults. Which is an very common pattern seen in boot loader bugs.

PcdUninstallMemAttrProtocol

Compile time options, for ArmVirt only. Brand new, committed to the edk2 repo this week (Dec 12th 2023). When set to TRUE the EFI_MEMORY_ATTRIBUTE_PROTOCOL will be unistalled. Default is FALSE.

Setting this to TRUE will work around the shim bug.

opt/org.tianocore/UninstallMemAttrProtocol

Runtime option, for ArmVirt only. Also new. Can be set using -fw_cfg on the qemu command line: -fw_cfg name=opt/org.tianocore/UninstallMemAttrProtocol,string=y|n. This is a runtime override for PcdUninstallMemAttrProtocol. Works for both enabling and disabling the shim bug workaround.

In the future PcdDxeNxMemoryProtectionPolicy will probably disappear in favor of memory profiles, which will allow to configure the same settings (plus a few more) at runtime.

Hands on, part #1 — using fedora edk2 builds

The default builds in the edk2-ovmf and edk2-aarch64 packages are configured to be bug compatible, so VMs should boot fine even in case the guests are using a buggy boot chain.

While this is great for end users it doesn't help much for bootloader development and testing, so there are alternatives. The edk2-experimental package comes with a collection of builds better suited for that use case, configured with strict memory protections and (on aarch64) EFI_MEMORY_ATTRIBUTE_PROTOCOL enabled, so you can see buggy builds actually crash and burn. 🔥

AARCH64 architecture

For AARCH64 this is /usr/share/edk2/experimental/QEMU_EFI-strictnx-pflash.raw. The magic words for libvirt are:

 type='kvm'>
[ ... ]
  
     arch='aarch64' machine='virt'>hvm
     readonly='yes' type='pflash'>/usr/share/edk2/experimental/QEMU_EFI-strictnx-pflash.raw
     template='/usr/share/edk2/aarch64/vars-template-pflash.raw'/>
  
[ ... ]

If a page fault happens you will get this line ...

  Synchronous Exception at 0x00000001367E6578

... on the serial console, followed by a stack trace and register dump.

X64 architecture

For X64 this is /usr/share/edk2/experimental/OVMF_CODE_4M.secboot.strictnx.qcow2. Needs edk2-20231122-12.fc39 or newer. The magic words for libvirt are:

 type='kvm'>
[ ... ]
  
     arch='x86_64' machine='q35'>hvm
     readonly='yes' secure='yes' type='pflash' format='qcow2'>/usr/share/edk2/experimental/OVMF_CODE_4M.secboot.strictnx.qcow2
     template='/usr/share/edk2/ovmf/OVMF_VARS_4M.secboot.qcow2' format='qcow2'/>
  
[ ... ]

It is also a good idea to add a debug console to capture the firmware log:

     type='null'>
       file='/path/to/firmware.log' append='off'/>
       type='isa-debug' port='1'>
         name='isa-debugcon'/>
      
       type='isa' iobase='0x402'/>

If you are lucky the page fault is logged there, also with an register dump. If you are not so lucky the VM will just reset and reboot.

Hands on, part #2 — using virt-firmware

The virt-firmware project is a collection of python modules and scripts for working with efi variables, efi varstores and also pe binaries. In case your distro hasn't packages you can install it using pip like most python packages.

virt-fw-vars

The virt-fw-vars utility can work with efi varstores. For example it is used to create the OVMF_VARS*secboot* files, enrolling the secure boot certificates into the efi security databases.

The simplest operation is to print the variable store:

virt-fw-vars --input /usr/share/edk2/ovmf/OVMF_VARS_4M.secboot.qcow2 \
             --print --verbose | less

When updating edk2 varstores virt-fw-vars always needs both input and output files. If you want change an existing variable store both input and output can point to the same file. For example you can turn on shim logging for an existing libvirt guest this way:

virt-fw-vars --input /var/lib/libvirt/qemu/nvram/${guest}_VARS.qcow2 \
             --output /var/lib/libvirt/qemu/nvram/${guest}_VARS.qcow2 \
             --set-shim-verbose

The next virt-firmware version will get a new --inplace switch to avoid listing the file twice on the command line for this use case.

If you want start from scratch you can use an empty variable store from /usr/share/edk2 as input. For example when creating a new variable store template with the test CA certificate (shipped with pesign.rpm) enrolled additionally:

dnf install -y pesign
certutil -L -d /etc/pki/pesign-rh-test -n "Red Hat Test CA" -a \
             | openssl x509 -text > rh-test-ca.txt
virt-fw-vars --input /usr/share/edk2/ovmf/OVMF_VARS_4M.qcow2 \
             --output OVMF_VARS_4M.secboot.rhtest.qcow2 \
             --enroll-redhat --secure-boot \
             --add-db OvmfEnrollDefaultKeys rh-test-ca.txt

The test CA will be used by all Fedora, CentOS Stream and RHEL build infrastructure to sign unofficial builds, for example when doing scratch builds in koji or when building rpms locally on your developer workstation. If you want test such builds in a VM, with secure boot enabled, this is a convenient way to do it.

pe-inspect

Useful for having a look at EFI binaries is pe-inspect. If this isn't present try pe-listsigs. Initially the utility only listed the signatures, but was extended over time to show more information, so I added the pe-inspect alias later on.

Below is the output for an 6.6 x86 kernel, you can see it does not have the patches to page-align the sections:

# file: /boot/vmlinuz-6.6.4-200.fc39.x86_64
#    section: file 0x00000200 +0x00003dc0  virt 0x00000200 +0x00003dc0  r-x (.setup)
#    section: file 0x00003fc0 +0x00000020  virt 0x00003fc0 +0x00000020  r-- (.reloc)
#    section: file 0x00003fe0 +0x00000020  virt 0x00003fe0 +0x00000020  r-- (.compat)
#    section: file 0x00004000 +0x00df6cc0  virt 0x00004000 +0x05047000  r-x (.text)
#    sigdata: addr 0x00dfacc0 +0x00000d48
#       signature: len 0x5da, type 0x2
#          certificate
#             subject CN: Fedora Secure Boot Signer
#             issuer  CN: Fedora Secure Boot CA
#       signature: len 0x762, type 0x2
#          certificate
#             subject CN: kernel-signer
#             issuer  CN: fedoraca

pe-inspect also knows the names for a number of special sections and supports decoding and pretty-printing them, for example here:

# file: /usr/lib/systemd/boot/efi/systemd-bootx64.efi
#    section: file 0x00000400 +0x00011a00  virt 0x00001000 +0x0001191f  r-x (.text)
#    section: file 0x00011e00 +0x00003a00  virt 0x00013000 +0x00003906  r-- (.rodata)
#    section: file 0x00015800 +0x00000400  virt 0x00017000 +0x00000329  rw- (.data)
#    section: file 0x00015c00 +0x00000200  virt 0x00018000 +0x00000030  r-- (.sdmagic)
#       #### LoaderInfo: systemd-boot 254.7-1.fc39 ####
#    section: file 0x00015e00 +0x00000200  virt 0x00019000 +0x00000049  r-- (.osrel)
#    section: file 0x00016000 +0x00000200  virt 0x0001a000 +0x000000de  r-- (.sbat)
#       sbat,1,SBAT Version,sbat,1,https://github.com/rhboot/shim/blob/main/SBAT.md
#       systemd,1,The systemd Developers,systemd,254,https://systemd.io/
#       systemd.fedora,1,Fedora Linux,systemd,254.7-1.fc39,https://bugzilla.redhat.com/
#    section: file 0x00016200 +0x00000200  virt 0x0001b000 +0x00000084  r-- (.reloc)

virt-fw-sigdb

The last utility I want introduce is virt-fw-sigdb, which can create, parse and modify signature databases. The signature database format is used by the firmware to store certificates and hashes in EFI variables. But sometimes the format used for files too. virt-firmware has the functionality anyway, so I've added a small frontend utility to work with those files.

One file in signature database format is /etc/pki/ca-trust/extracted/edk2/cacerts.bin which contains the list of of trusted CAs in sigature database format. Can be used to pass the CA list to the VM firmware for TLS connections (https network boot).

Shim also uses that format when compiling multiple certificates into the built-in VENDOR_DB or VENDOR_DBX databases.

Final remarks

Thats it for today folks. Hope you find this useful.

physical address space in qemu

2023-12-01T00:00:00+01:00

The physical addess space is where all memory and most IO resources are located. PCI memory bars, PCI MMIO bars, platform devices like lapic, io-apic, hpet, tpm, ...

On your linux machine you can use lscpu to see the size of the physical address space:

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
                         ^^^^^^^^^^^^^^^^
[ ... ]

In /proc/iomem you can see how the address space is used. Note that the actual addresses are only shown to root.

The physical address space problem on x86_64

The very first x86_64 processor (AMD Opteron) shipped with a physical address space of 40 bits (aka one TeraByte). So when qemu added support for the (back then) new architecture the qemu vcpu likewise got 40 bits of physical address space, probably assuming that this would be a safe baseline. It is still the default in qemu (version 8.1 as of today) for backward compatibility reasons.

Enter Intel. The first 64-bit processors shipped by Intel featured only 36 bits of physical address space. More recent Intel processors have 39, 42 or more physical address bits. Problem is this limit applies not only to the real physical addess space, but also to Extended Page Tables (EPT). Which means the physical address space of virtual machines is limited too.

So, the problem is the virtual machine firmware does not know how much physical address space it actually has. When checking CPUID it gets back 40 bits, but it could very well be it actually has only 36 bits.

Traditional firmware behavior

To address that problem the virtual machine firmware was very conservative with address space usage, to avoid crossing the unknown limit.

OVMF used to have a MMIO window with fixed size (32GB), which was based on the first multiple of 32GB after normal RAM. So a typical, smallish virtual machine had 0 -> 32GB for RAM and 32GB -> 64GB for IO, staying below the limit for 36 bits of physical address space (which equals 64GB).

VMs having more than 30GB of RAM will need address space above 32GB for RAM, which pushes the IO window above the 64GB limit. The assumtion that hosts which have enough physical memory to run such big virtual machines also have a physical address space larger than 64GB seems to have worked good enough.

Nevertheless the fixed 32G-sized IO window became increasingly problematic. Memory sizes are growing, not only for main memory, but also for device memory. GPUs have gigabytes of memory these days.

Config options in qemu

Qemu has tree -cpu options to control physical address space advertized to the guest, for quite a while already.

host-phys-bits={on,off}: When enabled qemu will use the hosts physical address bits for the guest, i.e. the guest can see the actual limit. I recommend enable this everywhere.
Upstream default: off (except for -cpu host where it is on).
Some downstream linux distro builds flip this to on by default.
host-phys-bits-limit=bits: Is used only with host-phys-bits=on. Can be used to reduce the number of physical address space bits communicated to the guest. Useful for live migration compatibility in case your machine cluster has machines with different physical address space sizes.
phys-bits=bits: Is used only with host-phys-bits=off. Can be used to set the number of physical address space bits to any value you want, including non-working values. Use only if you know what you are doing, it's easy to shot yourself into the foot with this one.

Changes in OVMF

Recent OVMF versions (edk2-stable202211 and newer) try to figure the size of the physical address space using a heuristic: In case the physical address space bits value received via CPUID is 40 or below it is checked against known-good values, which are 36 and 39 for Intel processors and 40 for AMD processors. If that check passes or the number of bits is 41 or higher OVMF assumes qemu is configured with host-phys-bits=on and the value can be trusted.

In case there is no trustworthy phys-bits value OVMF will continue with the traditional behavior described above.

In case OVMF trusts the phys-bits value it will apply some OVMF-specific limitations before actually using it:

The concept if virtual memory does not exist in UEFI, so the firmware will identity-map everything. Without 5-level paging (which is not yet supported in OVMF) at most 128TB (phys-bits=47) can be identity-mapped, so OVMF can not use more than that.
The actual limit is phys-bits=46 (64TB) for now due to older linux kernels (4.15) having problems if OVMF uses phys-bits=47.
In case gigabyte pages are not available OVMF will not use more than phys-bits=40 (1TB). This avoids high memory usage and long boot times due to OVMF creating lots of page tables for the identity mapping.

The final phys-bits value will be used to calculate the size of the physical address space available. The 64-bit IO window will be placed as high as possibe, i.e. at the end of the physical address space. The size of the IO window and also the size of the PCI bridge windows (for prefetchable 64-bit bars) will be scaled up with the physical address space, i.e. on machines with a larger physical address space you will also get larger IO windows.

Changes in SeaBIOS

Starting with version 1.16.3 SeaBIOS uses a heuristic simliar to OVMF to figure whenever there is a trustworthy phys-bits value.

If that is the case SeaBIOS will enable the 64-bit IO window by default and place it at the end of the address space like OVMF does. SeaBIOS will also scale the size of the IO window with the size of the address space.

Although the overall behavior is simliar there are some noteworthy differences:

SeaBIOS will not enable the 64-bit IO window in case there is no RAM above 4G, for better compatibility with old -- possibly 32-bit -- guests.
SeaBIOS will not enable the 64-bit IO window in case the CPU has no support for long mode (i.e. it is a 32-bit processor), likewise for better compatibility with old guests.
SeaBIOS will limit phys-bits to 46, simliar to OVMF, likewise for better compatibility with old guests. SeaBIOS does not use paging though and does not care about support for gigabyte pages, it will never limit phys-bits to 40.
SeaBIOS has a list of devices which will never be placed in the 64-bit IO window. This list includes devices where SeaBIOS drivers must be able to access the PCI bars. SeaBIOS runs in 32-bit mode so these PCI bars must be mapped below 4GB.

Changes in qemu

Starting with release 8.2 the firmware images bundled with upstream qemu are new enough to include the OVMF and SeaBIOS changes described above.

Live migration and changes in libvirt

The new firmware behavior triggered a few bugs elsewhere ...

When doing live migration the vcpu configuration on source and target host must be identical. That includes the size of the physical address space.

libvirt can calculate the cpu baseline for a given cluster, i.e. create a vcpu configuration which is compatible with all cluster hosts. That calculation did not include the size of the physical address space though.

With the traditional, very conservative firmware behavior this bug did not cause problems in practice, but with OVMF starting to use the full physical address space live migrations in heterogeneous clusters started to fail because of that.

In libvirt 9.5.0 and newer this has been fixed.

Trouble shooting tips

In general, it is a good idea to set the qemu config option host-phys-bits=on.

In case guests can't deal with PCI bars being mapped at high addresses the host-phys-bits-limit=bits option can be used to limit the address space usage. I'd suggest to stick to values seen in actual processors, so 40 for AMD and 39 for Intel are good candidates.

In case you are running 32-bit guests with alot of memory (which btw isn't a good idea performance-wise) you might need turn off long mode support to force the PCI bars being mapped below 4G. This can be done by simply using qemu-system-i386 instead of qemu-system-x86_64, or by explicitly setting lm=off in the -cpu options.

edk2 and firmware packaging

2022-07-20T00:00:00+02:00

Firmware autobuilder goes EOL

Some people already noticed and asked questions. So guess I better write things down in my blog so I don't have to answer the questions over and over again, and I hope to also clarify some things on distro firmware builds.

So, yes, the jenkins autobuilder creating the firmware repository at https://www.kraxel.org/repos/jenkins/ has been shutdown yesterday (Jul 19th 2022). The repository will stay online for the time being, so your establish workflows will not instantly break. But the repository will not get updates any more, so it is wise to start looking for alternatives now.

The obvious primary choice would be to just use the firmware builds provided by your distribution. I'll cover edk2 only, which seems to be the by far most popular use, even thought here are also builds for other firmware projects.

RHEL / Fedora edk2 firmware builds

Given I'm quite familier with the RHEL / Fedora world I can give some advise here. The edk2-ovmf package comes with multiple images for the firmware code and the varstore template which allow for various combinations. The most important ones are:

OVMF_CODE.secboot.fd and OVMF_VARS.secboot.fd: Run the secure-boot capable firmware build with secure boot enabled. The varstore has the microsoft secure boot keys enrolled and secure boot enabled.
Requires q35. Requires smm mode support (which is enabled by default these days).
OVMF_CODE.secboot.fd and OVMF_VARS.fd: Run the secure-boot capable firmware build with secure boot disabled. The varstore is blank.
Requires q35 and smm mode support too.
OVMF_CODE.fd and OVMF_VARS.fd: Run the firmware build without secure boot support with the blank varstore.
Works with both q35 and pc machine types. Only available on Fedora.

Configure libvirt domains for UEFI

The classic way to setup this in libvirt looks like this:

 type='kvm'>
[ ... ]
  
     arch='x86_64' machine='q35'>hvm
     readonly='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE.secboot.fd
     template='/usr/share/OVMF/OVMF_VARS.fd'/>

To make this easier the firmware builds come with json files describing the capabilities and requirements. You can find these files in /usr/share/qemu/firmware/. libvirt can use them to automatically find suitable firmware images, so you don't have to write the firmware image paths into the domain configuration. You can simply use this instead:

 type='kvm'>
[ ... ]
   firmware='efi'>
     arch='x86_64' machine='q35'>hvm

libvirt also allows to ask for specific firmware features. If you don't want use secure boot for example you can ask for the blank varstore template (no secure boot keys enrolled) this way:

 type='kvm'>
[ ... ]
   firmware='efi'>
     arch='x86_64' machine='q35'>hvm
    
       name='enrolled-keys' enabled='no' />

In case you change the configuration of an existing virtual machine you might (depending on the kind of change) have to run virsh start --reset-nvram domain once to to start over with a fresh copy of the varstore template.

But why shutdown the autobuilder?

The world has moved forward. UEFI isn't a niche use case any more. Linux distributions all provide good packages theys days. The edk2 project got good CI coverage (years ago it was my autobuilder raising the flag when a commit broke the gcc build). The edk2 project got a regular release process distros can (and do) follow.

All in all the effort to maintain the autobuilder doesn't look justified any more.

edk2 quickstart for virtualization

2022-05-17T00:00:00+02:00

Here is a quickstart for everyone who wants (or needs to) deal with edk2 firmware, with a focus on virtual machine firmware. The article assumes you are using a linux machine with gcc.

Building firmware for VMs

To build edk2 you need to have a bunch of tools installed. An compiler and the make are required of course, but also iasl, nasm and libuuid. So install them first (package names are for centos/fedora).

dnf install -y make gcc binutils iasl nasm libuuid-devel

If you want cross-build arm firmware on a x86 machine you also need cross compilers. While being at also set the environment variables needed to make the build system use the cross compilers:

dnf install -y gcc-aarch64-linux-gnu gcc-arm-linux-gnu
export GCC5_AARCH64_PREFIX="aarch64-linux-gnu-"
export GCC5_ARM_PREFIX="arm-linux-gnu-"

Next clone the tiaocore/edk2 repository and also fetch the git submodules.

git clone https://github.com/tianocore/edk2.git
cd edk2
git submodule update --init

The edksetup script will prepare the build environment for you. The script must be sourced because it sets some environment variables (WORKSPACE being the most important one). This must be done only once (as long as you keep the shell with the configured environment variables open).

source edksetup.sh

Next step is building the BaseTools (also needed only once):

make -C BaseTools

Note: Currently (April 2022) BaseTools are being rewritten in Python, so most likely this step will not be needed any more at some point in the future.

Finally the build (for x64 qemu) can be kicked off:

build -t GCC5 -a X64 -p OvmfPkg/OvmfPkgX64.dsc

The firmware volumes built can be found in Build/OvmfX64/DEBUG_GCC5/FV.

Building the aarch64 firmware instead:

build -t GCC5 -a AARCH64 -p ArmVirtPkg/ArmVirtQemu.dsc

The build results land in Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/FV.

Qemu expects the aarch64 firmware images being 64M im size. The firmware images can't be used as-is because of that, some padding is needed to create an image which can be used for pflash:

dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64
dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
dd of="QEMU_VARS-pflash.raw" if="/dev/zero" bs=1M count=64
dd of="QEMU_VARS-pflash.raw" if="QEMU_VARS.fd" conv=notrunc

There are a bunch of compile time options, typically enabled using -D NAME or -D NAME=TRUE. Options which are enabled by default can be turned off using -D NAME=FALSE. Available options are defined in the *.dsc files referenced by the build command. So a feature-complete build looks more like this:

build -t GCC5 -a X64 -p OvmfPkg/OvmfPkgX64.dsc \
    -D FD_SIZE_4MB \
    -D NETWORK_IP6_ENABLE \
    -D NETWORK_HTTP_BOOT_ENABLE \
    -D NETWORK_TLS_ENABLE \
    -D TPM2_ENABLE

Secure boot support (on x64) requires SMM mode. Well, it builds and works without SMM, but it's not secure then. Without SMM nothing prevents the guest OS writing directly to flash, bypassing the firmware, so protected UEFI variables are not actually protected.

Also suspend (S3) support works with enabled SMM only in case parts of the firmware (PEI specifically, see below for details) run in 32bit mode. So the secure boot variant must be compiled this way:

build -t GCC5 -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc \
    -D FD_SIZE_4MB \
    -D SECURE_BOOT_ENABLE \
    -D SMM_REQUIRE \
    [ ... add network + tpm + other options as needed ... ]

The FD_SIZE_4MB option creates a larger firmware image, being 4MB instead of 2MB (default) in size, offering more space for both code and vars. The RHEL/CentOS builds use that. The Fedora builds are 2MB in size, for historical reasons.

If you need 32-bit firmware builds for some reason, here is how to do it:

build -t GCC5 -a ARM -p ArmVirtPkg/ArmVirtQemu.dsc
build -t GCC5 -a IA32 -p OvmfPkg/OvmfPkgIa32.dsc

The build results will be in in Build/ArmVirtQemu-ARM/DEBUG_GCC5/FV and Build/OvmfIa32/DEBUG_GCC5/FV

Booting fresh firmware builds

The x86 firmware builds create three different images:

OVMF_VARS.fd: This is the firmware volume for persistent UEFI variables, i.e. where the firmware stores all configuration (boot entries and boot order, secure boot keys, ...). Typically this is used as template for an empty variable store and each VM gets its own private copy, libvirt for example stores them in /var/lib/libvirt/qemu/nvram.
OVMF_CODE.fd: This is the firmware volume with the code. Separating this from VARS does (a) allow for easy firmware updates, and (b) allows to map the code read-only into the guest.
OVMF.fd: The all-in-one image with both CODE and VARS. This can be loaded as ROM using -bios, with two drawbacks: (a) UEFI variables are not persistent, and (b) it does not work for SMM_REQUIRE=TRUE builds.

qemu handles pflash storage as block devices, so we have to create block devices for the firmware images:

CODE=${WORKSPACE}/Build/OvmfX64/DEBUG_GCC5/FV/OVMF_CODE.fd
VARS=${WORKSPACE}/Build/OvmfX64/DEBUG_GCC5/FV/OVMF_VARS.fd
qemu-system-x86_64 \
  -drive if=none,id=code,format=raw,file=${CODE},readonly=on \
  -drive if=none,id=vars,format=raw,file=${VARS},snapshot=on \
  -machine q35,pflash0=code,pflash1=vars \
  [ ... ]

Here is the arm version of that (using the padded files created using dd, see above):

CODE=${WORKSPACE}/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/FV/QEMU_EFI-pflash.raw
VARS=${WORKSPACE}/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/FV/QEMU_VARS-pflash.raw
qemu-system-aarch64 \
  -drive if=none,id=code,format=raw,file=${CODE},readonly=on \
  -drive if=none,id=vars,format=raw,file=${VARS},snapshot=on \
  -machine virt,pflash0=code,pflash1=vars \
  [ ... ]

Source code structure

The core edk2 repo holds a number of packages, each package has its own toplevel directory. Here are the most interesting ones:

OvmfPkg: This holds both the x64-specific code (i.e. OVMF itself) and virtualization-specific code shared by all architectures (virtio drivers).
ArmVirtPkg: Arm specific virtual machine support code.
MdePkg, MdeModulePkg: Most core code is here (PCI support, USB support, generic services and drivers, ...).
PcAtChipsetPkg: Some Intel architecture drivers and libs.
ArmPkg, ArmPlatformPkg: Common Arm architecture support code.
CryptoPkg, NetworkPkg, FatPkg, CpuPkg, ...: As the names of the packages already suggest: Crypto support (using openssl), Network support (including network boot), FAT Filesystem driver, ...

Firmware boot phases

The firmware modules in the edk2 repo often named after the boot phase they are running in. Most drivers are named SomeThingDxe for example.

ResetVector: This is where code execution starts after a machine reset. The code will do the bare minimum needed to enter SEC. On x64 the most important step is the transition from 16-bit real mode to 32-bit mode or 64bit long mode.
SEC (Security): This code typically loads and uncompresses the code for PEI and SEC. On physical hardware SEC often lives in ROM memory and can not be updated. The PEI and DXE firmware volumes are loaded from (updateable) flash.; With OVMF both SEC firmware volume and the compressed volume holding PXE and DXE code are part of the OVMF_CODE image and will simply be mapped into guest memory.
PEI (Pre-EFI Initialization): Platform Initialization is done here. Initialize the chipset. Not much to do here in virtual machines, other than loading the x64 e820 memory map (via fw_cfg) from qemu, or get the memory map from the device tree (on aarch64). The virtual hardware is ready-to-go without much extra preaparation.; PEIMs (PEI Modules) can implement functionality which must be executed before entering the DXE phase. This includes security-sensitive things like initializing SMM mode and locking down flash memory.
DXE (Driver Execution Environment): When PEI is done it hands over control to the full EFI environment contained in the DXE firmware volume. Most code is here. All kinds of drivers. the firmware setup efi app, ...; Strictly speaking this isn't only one phase. The code for all phases after PEI is part of the DXE firmware volume though.

Useful Links

Introducing ovmfctl

2022-03-01T00:00:00+01:00

New project: Tools for for ovmf (and armvirt) firmware volumes. It's written in python and can be installed with a simple pip3 install ovmfctl. The project is hosted at gitlab.

ovmfdump

Usage: ovmfctl --input file.fd.

It's a debugging tool which just prints the structure and content of firmware volumes.

ovmfctl

This is a tool to print and modify variable store volumes. Main focus has been on certificate handling so far.

Enrolling certificates for secure boot support in virtual machines has been a rather painfull process. It's handled by EnrollDefaultKeys.efi which needs to be started inside a virtual machine to enroll the certificates and enable secure boot mode.

With ovmfctl it is dead simple:

ovmfctl --input /usr/share/edk2/ovmf/OVMF_VARS.fd \
        --enroll-redhat \
        --secure-boot \
        --output file.fd

This enrolls the Red Hat Secure Boot certificate which is used by Fedora, CentOS and RHEL as platform key. The usual Microsoft certificates are added to the certificate database too, so windows guests and shim.efi work as expected.

If you want more fine-grained control you can use the --set-pk, --add-kek, --add-db and --add-mok switches instead. The --enroll-redhat switch above is actually just a shortcut for:

--set-pk  a0baa8a3-041d-48a8-bc87-c36d121b5e3d RedHatSecureBootPKKEKkey1.pem \
--add-kek a0baa8a3-041d-48a8-bc87-c36d121b5e3d RedHatSecureBootPKKEKkey1.pem \
--add-kek 77fa9abd-0359-4d32-bd60-28f4e78f784b MicrosoftCorporationKEKCA2011.pem \
--add-db  77fa9abd-0359-4d32-bd60-28f4e78f784b MicrosoftWindowsProductionPCA2011.pem \
--add-db  77fa9abd-0359-4d32-bd60-28f4e78f784b MicrosoftCorporationUEFICA2011.pem \

If you just want the variable store be printed use ovmfctl --input file.fd --print. Add --hexdump for more details.

Extract all certificates: ovmfctl --input file.fd --extract-certs.

Try ovmfctl --help for a complete list of command line switches. Note that Input and output file can be indentical for inplace updates.

That's it. Enjoy!

processing patch mails with b4 and notmuch

2021-11-22T00:00:00+01:00

This blog post describes my mail setup, with a focus on how I handle patch email. Lets start with a general mail overview. Not going too deep into the details here, the internet has plenty of documentation and configuration tutorials.

Outgoing mail

Most of my machines have a local postfix configured for outgoing mail. My workstation and my laptop forward all mail (over vpn) to the company internal email server. All I need for this to work is a relayhost line in /etc/postfix/main.cf:

relayhost = [smtp.corp.redhat.com]

Most unix utilities (including git send-email) try to send mails using /usr/sbin/sendmail by default. This tool will place the mail in the postfix queue for processing. The name of the binary is a convention dating back to the days where sendmail was the one and only unix mail processing daemon.

Incoming mail

All my mail is synced to local maildir storage. I'm using offlineimap for the job. Plenty of other tools exist, isync is another popular choice.

Local mail storage has the advantage that reading mail is faster, especially in case you have a slow internet link. Local mail storage also allows to easily index and search all your mail with notmuch.

Filtering mail

I'm using server side filtering. The major advantage is that I always have the same view on all my mail. I can use a mail client on my workstation, the web interface or a mobile phone. Doesn't matter, I always see the same folder structure.

Reading mail

All modern email clients should be able to use maildir folders. I'm using neomutt. I also have used thunderbird and evolution in the past. All working fine.

The reason I use neomutt is that it is simply faster than GUI-based mailers, which matters when you have to handle alot of email. It is also easy very to hook up scripts, which is very useful when it comes to patch processing.

Outgoing patches

I'm using git send-email for the simple cases and git-publish for the more complex ones. Where "simple" typically is single changes (not a patch series) where it is unlikely that I have to send another version addressing review comments.

git publish keeps track of the revisions you have sent by storing a git tag in your repo. It also stores the cover letter and the list of people Cc'ed on the patch, so sending out a new revision of a patch series is much easier than with plain git send-email.

git publish also features config profiles. This is helpful for larger projects where different subsystems use different mailing lists (and possibly different development branches too).

Incoming patches

So, here comes the more interesting part: Hooking scripts into neomutt for patch processing. Lets start with the config (~/.muttrc) snippet:

# patch processing
bind	index,pager	p	noop			# default: print
macro	index,pager	pa	"~/.mutt/bin/patch-apply.sh"
macro	index,pager	pl	"~/.mutt/bin/patch-lore.sh"

First I map the 'p' key to noop (instead of print which is the default configuration), which allows to use two-key combinations starting with 'p' for patch processing. Then 'pa' is configured to run my patch-apply.sh script, and 'pl' runs patch-lore.sh.

Lets have a look at the patch-apply.sh script which applies a single patch:

#!/bin/sh

# store patch
file="$(mktemp ${TMPDIR-/tmp}/mutt-patch-apply-XXXXXXXX)"
trap "rm -f $file" EXIT
cat > "$file"

# find project
source ~/.mutt/bin/patch-find-project.sh
if test "$project" = ""; then
        echo "ERROR: can't figure project"
        exit 1
fi

# go!
clear
cd $HOME/projects/$project
branch=$(git rev-parse --abbrev-ref HEAD)

clear
echo "#"
echo "# try applying patch to $project, branch $branch"
echo "#"

if git am --message-id --3way --ignore-whitespace --whitespace=fix "$file"; then
        echo "#"
        echo "# OK"
        echo "#"
else
        echo "# FAILED, cleaning up"
        cp -v .git/rebase-apply/patch patch-apply-failed.diff
        cp -v "$file" patch-apply-failed.mail
        git am --abort
        git reset --hard
fi

The mail is passed to the script on stdin, so the first thing the script does is to store that mail in a temporary file. Next it goes try figure which project the patch is for. The logic for that is in a separate file so other scripts can share it, see below. Finally try to apply the patch using git am. In case of a failure store both decoded patch and complete email before cleaning up and exiting.

Now for patch-find-project.sh. This script snippet tries to figure the project by checking which mailing list the mail was sent to:

#!/bin/sh
if test "$PATCH_PROJECT" != ""; then
        project="$PATCH_PROJECT"
elif grep -q -e "devel@edk2.groups.io" "$file"; then
        project="edk2"
elif grep -q -e "qemu-devel@nongnu.org" "$file"; then
        project="qemu"
# [ ... more checks snipped ... ]
fi
if test "$project" = ""; then
        echo "Can't figure project automatically."
        echo "Use env var PATCH_PROJECT to specify."
fi

The PATCH_PROJECT environment variable can be used to override the autodetect logic if needed.

Last script is patch-lore.sh. That one tries to apply a complete patch series, with the help of the b4 tool. b4 makes patch series management an order of magnitude simpler. It will find the latest revision of a patch series, bring the patches into the correct order, pick up tags (Reviewed-by, Tested-by etc.) from replies, checks signatures and more.

#!/bin/sh

# store patch
file="$(mktemp ${TMPDIR-/tmp}/mutt-patch-queue-XXXXXXXX)"
trap "rm -f $file" EXIT
cat > "$file"

# find project
source ~/.mutt/bin/patch-find-project.sh
if test "$project" = ""; then
	echo "ERROR: can't figure project"
	exit 1
fi

# find msgid
msgid=$(grep -i -e "^message-id:" "$file" | head -n 1 \
	| sed -e 's/.* -e 's/>.*//')

# go!
clear
cd $HOME/projects/$project
branch=$(git rev-parse --abbrev-ref HEAD)

clear
echo "#"
echo "# try queuing patch (series) for $project, branch $branch"
echo "#"
echo "# msgid: $msgid"
echo "#"

# create work dir
WORK="${TMPDIR-/tmp}/${0##*/}-$$"
mkdir "$WORK" || exit 1
trap 'rm -rf $file "$WORK"' EXIT

echo "# fetching from lore ..."
echo "#"
b4 am	--outdir "$WORK" \
	--apply-cover-trailers \
	--sloppy-trailers \
	$msgid || exit 1

count=$(ls $WORK/*.mbx 2>/dev/null | wc -l)
if test "$count" = "0"; then
	echo "#"
	echo "# got nothing, trying notmuch instead ..."
	echo "#"
	echo "# update db ..."
	notmuch new
	echo "# find thread ..."
	notmuch show \
		--format=mbox \
		--entire-thread=true \
		id:$msgid > $WORK/notmuch.thread
	echo "# process mails ..."
	b4 am	--outdir "$WORK" \
		--apply-cover-trailers \
		--sloppy-trailers \
		--use-local-mbox $WORK/notmuch.thread \
		$msgid || exit 1
	count=$(ls $WORK/*.mbx 2>/dev/null | wc -l)
fi

echo "#"
echo "# got $count patches, trying to apply ..."
echo "#"
if git am -m -3 $WORK/*.mbx; then
	echo "#"
	echo "# OK"
	echo "#"
else
	echo "# FAILED, cleaning up"
	git am --abort
	git reset --hard
fi

First part (store mail, find project) of the script is the same as patch-apply.sh. Then the script goes get the message id of the mail passed in and feeds that into b4. b4 will go try to find the email thread on lore.kernel.org. In case this doesn't return results the script will go query notmuch for the email thread instead and feed that into b4 using the --use-local-mbox switch.

Finally it tries to apply the complete patch series prepared by b4 with git am.

So, with all that in place applying a patch series is just two key strokes in neomutt. Well, almost. I still need an terminal on the side which I use to make sure the correct branch is checked out, to run build tests etc.

Advanced network booting for virtual machines

2021-09-07T00:00:00+02:00

Network booting is cool. Once you have setup everything you can stop juggling iso images in your virtual machine configs. Instead you just kick a network boot and pick whatever you want install from the boot menu delivered by the boot server.

This article is not about the basics of setting up a boot server. The internet has tons of tutorials on how to install a tftp server and how to boot your favorite OS from tftp. This article will focus on configuring network boot for libvirt-managed virtual machines.

Before we get started ...

The config file snippets are examples from my home network, home.kraxel.org is the local domain and 192.168.2.14 is the machine acting as boot server here. You have to replace those to match your setup of course. The same is true for the boot file names.

The default libvirt network uses 192.168.122.0/24. In case you use that unmodified these addresses will work fine for you and in fact they should already be in your libvirt network configuration. If you have changed the default libvirt network I expect you know what you have to do 😎.

Step one: very basic netboot setup

That is pretty simple. libvirt has support for that, so all you have to do is adding a bootp tag with the ip address of your tftp server and the boot file name to the network config.


  [ ... ]
   address='192.168.122.1' netmask='255.255.255.0'>                                        
    
       start='192.168.122.2' end='192.168.122.254'/>                                    
       file='pxelinux.0' server='192.168.2.14'/>

You can edit the network configuration using virsh net-edit name. The default libvirt network is simply named default. The network needs an restart to apply any changes (virsh net-destroy name; virsh net-start name).

That was easy, right? Well, maybe not. In case this is not working for you try running modprobe nf_nat_tftp. tftp uses udp, which means there are no connections at ip level, so the kernel has to look into the tftp packets to figure how to route them correctly for a masqueraded network. The nf_nat_tftp kernel module does exactly that.

Note: Recent libvirt versions seem to take care to load nf_nat_tftp if needed, so there is a chance this works out-of-the-box for you.

Neverthelless that leads straight to the question: do we actually need tftp?

Step two: replace tftp with http

As you might have guessed the answer is no.

The ipxe boot roms support booting from http, by simply specifying an URL instead of a filename as bootfile. This was never formally specified though, so unfortunaly you can't expect this to work with every boot rom. For qemu-powered virtual machines this isn't a problem at all because the qemu boot roms are built from ipxe. With physical machines you might have to hop though some extra loops to chainload ipxe (not covered here).

The easiest way to get this going is to install apache on your tftp boot server, then configure a virtual host with the tftproot as document root. You can do so by dropping a snippet like this into /etc/httpd/conf.d/:

 "/var/lib/tftpboot">
        Options Indexes FollowSymLinks
        AllowOverride None
	Require all granted

 *:80>
        ServerName boot.home.kraxel.org
        DocumentRoot /var/lib/tftpboot

Enabling Indexing is not needed for boot server functionality, but might be handy if you want access the boot server with your web browser for trouble-shooting.

Using the tftproot as document root has the advantage that the paths are identical for both tftp and http boot, so your pxelinux and grub configuration files should continue to work unmodified.

Now you can go edit your libvirt network config and replace the bootp configuration with this:

 file='http://boot.home.kraxel.org/pxelinux.0'/>

Done. Don't forget to restart the network to apply the changes. Booting should be noticable faster now (especially when fetching larger initrds), and any NAT traversal problems should be gone too.

Extra tip for lazy people

When using http you can boot from pretty much any server on the internet, there is no need to setup your own. You can use for example the boot server provided by netboot.xyz with a large collection of operating systems available as live systems and for install. Here is the bootp snippet for this:

 file='http://boot.netboot.xyz/ipxe/netboot.xyz.lkrn'/>

In most cases probably want have a local boot server for faster installs. But for a one time test install of a new distro this might be more handy than downloading the install iso.

Step three: what about UEFI?

For EFI guests the pxelinux.0 is pretty much useless indeed, so we must do something else for them. First question is how do we figure this is a EFI guest asking for a boot file? Lets have a look at the dhcp request, BIOS guest goes first. Captured using tcpdump -i virbr0 -v port bootps:

[ ... ] 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:89:32:47 [ ... ]
	  Client-Ethernet-Address 52:54:00:89:32:47 (oui Unknown)
	  Vendor-rfc1048 Extensions
            [ ... ]
	    ARCH Option 93, length 2: 0
	    Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"

Now a request from a (x64) EFI guest:

[ ... ] 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:89:32:47 [ ... ]
	  Client-Ethernet-Address 52:54:00:89:32:47 (oui Unknown)
	  Vendor-rfc1048 Extensions
            { ... ]
	    ARCH Option 93, length 2: 7
	    Vendor-Class Option 60, length 32: "PXEClient:Arch:00007:UNDI:003001"

See? The EFI guest uses arch 7 instead of 0, in both option 93 and option 60. So we will use that.

Unfortunaly libvirt has no direct support for that. But libvirt uses dnsmasq as dhcp (and dns) server for the virtual networks. dnsmasq has support for this, and starting with libvirt version 5.6.0 it is possible to specify any dnsmasq config option in your libvirt network configuration using the dnsmasq xml namespace.

dnsmasq uses the concept of tags to implement this. Requests can be tagged using matches, and configurartion directives can be applied to requests with certain tags. So, here is how it looks like, using the efi-x64-pxe tag for x64 efi guests and /arch-x86_64/grubx64.efi as bootfile.

 xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
  [ ... ]
   address='192.168.122.1' netmask='255.255.255.0'>
    
       start='192.168.122.2' end='192.168.122.254'/>
       file='http://boot.home.kraxel.org/pxelinux.0'/>
    
  
  
     value='#'/>
     value='dhcp-match=set:efi-x64-pxe,option:client-arch,7'/>
     value='dhcp-boot=tag:efi-x64-pxe,/arch-x86_64/grubx64.efi,,192.168.2.14'/>

dnsmasq uses '#' for comments, and it is here only to visually separate entries a bit. It will also be in the dnsmasq config files created by libvirt (in /var/lib/libvirt/dnsmasq/).

Step four: Can UEFI guests use http too?

Sure. You might have already noticed that the UEFI boot manager has both UEFI PXEv4 and UEFI HTTPv4 entries. Here is what happens when you pick the latter:

[ ... ] 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:89:32:47 [ ... ]
	  Client-Ethernet-Address 52:54:00:89:32:47 (oui Unknown)
	  Vendor-rfc1048 Extensions
            [ ... ]
	    ARCH Option 93, length 2: 16
	    Vendor-Class Option 60, length 33: "HTTPClient:Arch:00016:UNDI:003001"

It's arch 16 now. Also option 60 starts with HTTPClient instead of PXEClient. So we can simply add another arch match to identify http clients.

Another detail we need to take care of is that the UEFI http boot client expect a reply with option 60 set to HTTPClient, otherwise it will be ignored. So we need to take care of that too, using dhcp-option-force. Here we go, using tag efi-x64-http for http clients:

 xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
  [ ... ]
  
     value='#'/>
     value='dhcp-match=set:efi-x64-pxe,option:client-arch,7'/>
     value='dhcp-boot=tag:efi-x64-pxe,/arch-x86_64/grubx64.efi,,192.168.2.14'/>
     value='#'/>
     value='dhcp-match=set:efi-x64-http,option:client-arch,16'/>
     value='dhcp-boot=tag:efi-x64-http,http://boot.home.kraxel.org/arch-x86_64/grubx64.efi'/>
     value='dhcp-option-force=tag:efi-x64-http,60,HTTPClient'/>

Extra tip for lazy people, now with UEFI

Complete example, defining a new libvirt network named netboot.xyz. You can store that in some file, then use virsh net-define file to create the network.

 xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
  netboot.xyz
   mode='nat'/>
   name='netboot0' stp='on' delay='0'/>
   address='192.168.123.1' netmask='255.255.255.0'>
    
       start='192.168.123.10' end='192.168.123.99'/>
       file='http://boot.netboot.xyz/ipxe/netboot.xyz.lkrn'/>
    
  
  
     value='dhcp-match=set:efi-x64-http,option:client-arch,16'/>
     value='dhcp-boot=tag:efi-x64-http,http://boot.netboot.xyz/ipxe/netboot.xyz.efi'/>
     value='dhcp-option-force=tag:efi-x64-http,60,HTTPClient'/>

Then, in your guest domain configration, use to use the new network. With this both BIOS and UEFI guests can netboot from netboot.xyz. With UEFI you have to take care to pick the UEFI HTTPv4 entry from the firmware boot menu.

Step five: architecture experiments

There is a world beyond x86. The arch field does not only specify the system architecture (bios vs. uefi) or the boot protocol (pxe vs. http), but also the cpu architecture. Here are the ones relevant for qemu:

Code	Architecture
0x00	BIOS pxeboot (both i386 and x86_64)
0x06	EFI pxeboot, IA32 (i386)
0x07	EFI pxeboot, X64 (x86_64)
0x0a	EFI pxeboot, ARM (v7)
0x0b	EFI pxeboot, AA64 (v8 / aarch64)
0x12	powerpc64
0x16	EFI httpboot, X64
0x18	EFI httpboot, ARM
0x19	EFI httpboot, AA64
0x31	s390x

So, if you want play with arm or powerpc without owning such a machine you can let qemu emulate it with tcg. If you want netboot it -- no problem, just add a few more lines to your network configuration. Here is an example for aarch64:

 xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
  [ ... ]
  
    [ ... ]
     value='#'/>
     value='dhcp-match=set:efi-aa64-pxe,option:client-arch,b'/>
     value='dhcp-boot=tag:efi-aa64-pxe,/arch-aarch64/grubaa64.efi,,192.168.2.14'/>
     value='#'/>
     value='dhcp-match=set:efi-aa64-http,option:client-arch,19'/>
     value='dhcp-boot=tag:efi-aa64-http,http://boot.home.kraxel.org/arch-aarch64/grubaa64.efi'/>
     value='dhcp-option-force=tag:efi-aa64-http,60,HTTPClient'/>

In case you are wondering why I place the grub binaries in subdirectories: grub tries fetch the config file from the same directory, so that way I get per-arch config files and they are named /arch-aarch64/grub.cfg, /arch-x86_64/grub.cfg and so on. A nice side effect is that the toplevel directory is a bit less cluttered with files.

And beyond libvirt?

Well, the fundamental idea doesn't change. Look at arch option, then send different replies depending on what you find there. With other dhcp servers the syntax is different, but the pattern is the same. Here is a sample snippet for the isc dhcp server shipped with most linux distributions:

option arch code 93 = unsigned integer 16;

subnet 192.168.2.0 netmask 255.255.255.0 {
        [ ... ]

        if (option arch = 00:16) {
		option vendor-class-identifier "HTTPClient";
		filename "http://boot.home.kraxel.org/arch-x86_64/grubx64.efi";
	} else if (option arch = 00:07) {
		next-server 192.168.2.14;
		filename "/arch-x86_64/grubx64.efi";
	} else {
		next-server 192.168.2.14;
		filename "/pxelinux.0";
	}
}

My kubernetes test cluster, overview.

2021-06-22T00:00:00+02:00

This is an article series about my kubernetes test cluster.

Cluster node installation on fedora and basic cluster setup.
Planned: Setup ingress and other useful cluster services.

My kubernetes test cluster, part one — install.

2021-06-22T00:00:00+02:00

I'm running a kubernetes test cluster in my home network. It is used to learn kubernetes and try out various things, for example kata containers and kubevirt. Not used much (yet?) for actual development.

After mentioning it here and there some people asked for details, so here we go. I'll go describe my setup, with some kubernetes and container basics sprinkled in.

This is part one of an article series and will cover cluster node installation and basic cluster setup.

The cluster nodes

Most cluster nodes are dual-core virtual machines. The control-plane node (formerly known as master node) has 8G of memory, most worker nodes have 4G of memory. It is a mix of x86_64 and aarch64 nodes. Kubernetes names these architectures amd64 and arm64, which is easily confused, so take care 😎.

The virtual nodes use bridged networking. So no separate network, they simply show up on my 192.168.2.0/24 home network like the physical machines connected. They get a static IP address assigned by the DHCP server, and I can easily ssh into each node.

All cluster nodes run Fedora 34, Server Edition.

Node configuration

I have a git repository with some config files, to simplify rebuilding a cluster node from scratch. The repository also has some shell scripts with the commands listed later in this blog post.

Lets go over the config files one by one.

$ cat /etc/sysctl.d/kubernetes.conf

kernel.printk=4
net.ipv4.ip_forward=1
net.bridge.bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-ip6tables=1

This is needed for kubernetes networking.

$ cat /etc/modules-load.d/kubernetes.conf

# networking
bridge
br_netfilter
# kata
vhost
vhost_net
vhost_vsock

Load some kernel modules needed at boot. Again for kubernetes networking. Also vhost support which is needed by kata containers.

$ cat /etc/yum.repos.d/kubernetes.repo

[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-$basearch
enabled=0
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg

The upstream kubernetes rpm repository. Note this is not enabled (enabled=0) because I don't want normal fedora system updates also update the kubernetes packages. For installing/updating kubernetes packages I can enable the repo using dnf --enablerepo=kubernetes ....

Package installation

Given I want play with different container runtimes I've decided to use cri-o, which allows to do just that. Fedora has packages. They are in a module though, so that must be enabled first.

$ sudo dnf module list cri-o
$ sudo dnf module enable cri-o:${version}

The cri-o version should match the kubernetes version you want run. That is not the case in my cluster right now because I've learned that after setting up the cluster, so obviously sky isn't falling in case they don't match. The next time I update the cluster I'll bring them into sync.

Now we can go install the packages from the fedora repos. cri-o, runc (default container runtime), and a handful of useful utilities.

$ sudo dnf install podman skopeo buildah runc cri-o cri-tools \
    containernetworking-plugins bridge-utils telnet jq

Next in line are the kubernetes packages from the google repo. The repo has all versions, not only the most recent, so you can ask for the version you want and you'll get it. As mentioned above the repo must be enabled on the command line.

$ sudo dnf install --enablerepo=kubernetes \
    {kubectl,kubeadm,kubelet}-${version}

Configure and start services

kubelet needs some configuration, my git repo with the config files has this:

$ cat /etc/sysconfig/kubelet

KUBELET_EXTRA_ARGS=--cgroup-driver=systemd --fail-swap-on=false

Asking kubelet to delegate all cgroups work to systemd is needed to make kubelet work with cgroups v2. With that in place we can reload the configuration and start the services:

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now crio
$ sudo systemctl enable --now kubelet

Kubernetes cluster nodes need a few firewall entries so the nodes can speak to each other. I was to lazy to setup all that and just turned off the firewall. The cluster isn't reachable from the internet anyway, so 🤷.

$ sudo systemctl disable --now firewalld

Initialize the control plane node

All the preparing steps up to this point are the same for all cluster nodes. Now we go initialize the control plane node.

$ sudo kubeadm init \
	--pod-network-cidr=10.85.0.0/16 \
	--kubernetes-version=${version} \
	--ignore-preflight-errors=Swap

Picked the 10.85.0.0/16 network because that happens to be the default network used by cri-o, see /etc/cni/net.d/100-crio-bridge.conf.

This command will take a while. It will pull kubernetes container images from the internet, start them using the kubelet service, and finally initialize the cluster.

kubeadm will write the config file needed to access the cluster with kubectl to /etc/kubernetes/admin.conf. It'll make you cluster root. Kubernetes names this cluster-admin role in the rbac (role based access control) scheme.

For my devel cluster I simply use that file as-is instead of setting up some more advanced user authentication and access control. I place a copy of the file at $HOME/.kube/config (the default location used by kubectl). Copying the file to other machines works, so I can also run kubectl on my laptop or workstation instead of ssh'ing into the control plane node.

Time to run the first kubectl command to see whenever everything worked:

$ kubectl get nodes

NAME                        STATUS   ROLES                  AGE   VERSION
k8s-node1.home.kraxel.org   Ready    control-plane,master   5m    v1.21.1

Yay! First milestone.

Side note: single node cluster

By default kubeadm init adds a taint to the control plane node so kubernetes wouldn't schedule pods there:

$ kubectl describe node k8s-node1.home.kraxel.org | grep NoSchedule

Taints:             node-role.kubernetes.io/master:NoSchedule

If you want go for a single node cluster all you have to do is remove that taint so kubernetes will schedule and run your pods directly on your new and shiny control plane node. The magic words for that are:

$ kubectl taint nodes --all node-role.kubernetes.io/master-

Done. You can start playing with the cluster now.

If you want add one or more worker nodes to the cluster instead, then watch kubernetes distribute the load, read on ...

Initialize worker nodes

The worker nodes need a bootstrap token to authenticate when they want join the cluster. The kubeadm init command creates a token and will also print the kubeadm join command needed to join. If you don't have that any more, no problem, you can always get the token later using kubeadm token list. In case the token did expire (they are valid for a day or so) you can create a new one using kubeadm token create. Beside the token kubeadm also needs the hostname and port to be used to connect to the control plane node. Default port for the kubernetes API is 6443, so ...

$ sudo kubeadm join "k8s-node1.home.kraxel.org:6443" \
	--token "${token}" \
	--discovery-token-unsafe-skip-ca-verification \
	--ignore-preflight-errors=Swap

... and check results:

$ kubectl get nodes

NAME                        STATUS   ROLES                  AGE   VERSION
k8s-node1.home.kraxel.org   Ready    control-plane,master   22m   v1.21.1
k8s-node2.home.kraxel.org   Ready                     2m    v1.21.1

The node may show up in "NotReady" state for a while when it did register already but didn't complete initialization yet.

Now repeat that procedure on every node you want add to the cluster.

Side note: scripting kubernetes with json

Both kubeadm and kubectl can return the data you ask for in various formats. By default they print a nice, human-readable table to the terminal. But you can also ask for yaml, json and others using the -o or --output switch. Specifically json is very useful for scripting, you can pipe the output through the jq utility (you might have noticed this in the list of packages to install at the start of this blog post) to fish out the items you actually need.

For starters two simple examples. You can get the raw bootstrap token this way:

$ kubeadm token list -o json | jq -r .token

Or check out some node details:

$ kubectl get node k8s-node1.home.kraxel.org -o json | jq .status.nodeInfo

{
  "architecture": "amd64",
  "bootID": "a18dcad0-3427-4a12-a238-7b815fe45ea0",
  "containerRuntimeVersion": "cri-o://1.19.0-dev",
  "kernelVersion": "5.12.9-300.fc34.x86_64",
  "kubeProxyVersion": "v1.21.1",
  "kubeletVersion": "v1.21.1",
  "machineID": "a2b3a7ba9ec54b2d84b66d70156702d2",
  "operatingSystem": "linux",
  "osImage": "Fedora 34 (Thirty Four)",
  "systemUUID": "7f4854c4-2b92-4fea-9bb7-3d28537af675"
}

There are way more possible use cases. When reading config and patch files kubectl likewise accepts both yaml and json as input.

Pod networking with flannel

There is one more basic thing to setup: Install a network fabric to get the pod network going. This is needed to allow pods running on different cluster nodes to talk to each other. When running a single node cluster this can be skipped.

There are a bunch of different solutions out there, I've settled for flannel in "host-gw" mode. First download kube-flannel.yml from github. Then tweak the configuration: Make sure the network matches the pod network passed to kubeadm init, and change the backend. Here are the changes I've made:

--- kube-flannel.yml	2021-04-26 11:15:09.820696429 +0200
+++ kube-flannel-local.yml	2021-04-26 11:15:18.403551923 +0200
@@ -125,9 +125,9 @@
     }
   net-conf.json: |
     {
-      "Network": "10.244.0.0/16",
+      "Network": "10.85.0.0/16",
       "Backend": {
-        "Type": "vxlan"
+        "Type": "host-gw"
       }
     }
 ---

Now apply the yaml file to install flannel:

$ kubectl apply -f kube-flannel-local.yml

The flannel pods are created in the kube-system namespace, you can check the status this way:

$ kubectl get pods -n kube-system

NAME                            READY   STATUS    RESTARTS   AGE
[ ... ]
kube-flannel-ds-5l7x6           1/1     Running   0          3m
kube-flannel-ds-7xjtz           1/1     Running   0          3m
[ ... ]

Once all pods are up and running your pod network should be working. One nice thing with "host-gw" mode is that this uses standard network routing of the cluster nodes and you can inspect the state with standard linux tools:

$ ip route list | grep 10.85

10.85.0.0/24 dev cni0 proto kernel scope link src 10.85.0.1 
10.85.1.0/24 via 192.168.2.112 dev enp2s0
[ ... ]

Each cluster node gets a /24 subnet of the pod network assigned. The cni0 device is the subnet of the local node. The other subnets are routed to the other cluster nodes. Pretty straight forward.

Rounding up

So, that's it for part one. The internet has tons of kubernetes tutorials and examples which you can try on the cluster now. One good starting point is Kubernetes by example.

My plan for part two of this article series is installing and configuring some useful cluster services, with one of them being ingress which is needed to access your cluster services with a web browser.