If this sounds familiar to you, it probably is. It means that memory should be either writable ("W", typically data), or executeable ("X", typically code), but not both. Elsewhere in the software industry this is standard security practice since ages. Now it starts to take off for UEFI firmware too.
This is a deep dive into recent changes, in both code (firmware) and administration (secure boot signing), the consequences this has for the linux, and the current state of affairs.
All UEFI memory allocations carry a memory type
(EFI_MEMORY_TYPE
). UEFI tracks since day one whenever
a memory allocation is meant for code or data, among a bunch of
other properties such as boot service vs. runtime service memory.
For a long time it didn't matter much in practice. The concept of virtual memory does not exist for UEFI. IA32 builds even run with paging disabled (and this is unlikely to change until the architecture disappears into irrelevance). Other architectures use identity mappings.
While UEFI does not use address translation, nowdays it can use page
tables to enforce memory attributes, including (but not limited to)
write and execute permissions. When configured to do so it will set
code pages to R-X
and data pages to RW-
instead of using RWX
everywhere, so code using memory
types incorrectly will trigger page faults.
New in the UEFI spec (added in version 2.10) is
the EFI_MEMORY_ATTRIBUTE_PROTOCOL
. Sometimes
properties of memory regions need to change, and this protocol can
be used to do so. One example is a self-uncompressing binary, where
the memory region the binary gets unpacked to initially must be
writable. Later (parts of) the memory region must be flipped from
writable to executeable.
As of today (Dec 2023) edk2 has
a EFI_MEMORY_ATTRIBUTE_PROTOCOL
implementation for the
ARM and AARCH64 architectures, so this is present in the ArmVirt
firmware builds but not in the OVMF builds.
In an effort to improve firmware security in general and especially for secure boot Microsoft changed the requirements for binaries they are willing to sign with their UEFI CA key.
One key requirement added is that the binary layout must allow to enforce memory attributes with page tables, i.e. PE binary sections must be aligned to page size (4k). Sections also can't be both writable and executable. And the application must be able to deal with data section being mapped as not executable (NX_COMPAT).
These requirements apply to the binary itself
(i.e. shim.efi
for linux systems) and everything loaded
by the binary (i.e. grub.efi
, fwupd.efi
and the linux kernel).
We had and party still have a bunch of problems in all components
involved in the linux boot process,
i.e. shim.efi
, grub.efi
and the efi stub
of the linux kernel.
Some are old bugs such as memory types not being used correctly, which start to cause problems due to the firmware becoming more strict. Some are new problems due to Microsoft raising the bar for PE binaries, typically sections not being page-aligned. The latter are easily fixed in most cases, often it is just a matter of adding alignment to the right places in the linker scripts.
Lets have closer look at the components one by one:
shim.efi
shim added code to use the new EFI_MEMORY_ATTRIBUTE_PROTOCOL
before it was actually implemented by any firmware. Then this
was released completely untested. Did not work out very
well, we got a nice time bomb, and edk2 implementing
EFI_MEMORY_ATTRIBUTE_PROTOCOL
for arm triggered it
...
Fixed in main
branch, no release yet.
Getting new shim.efi binaries signed by Microsoft depends on the complete boot chain being compilant with the new requirements, which prevents shim bugfixes being shipped to users right now.
That should be solved soon though, see the kernel section below.
grub.efi
grub.efi used to use memory types incorrectly.
Fixed upstream years ago, case closed.
Well, in theory. Upstream grub development goes at glacial speeds, so all distros carry a big stack of downstream patches. Not surprisingly that leads to upstream fixes being absorbed slowly and also to bugs getting reintroduced.
So, in practice we still have buggy grub versions in the wild. It is getting better though.
The linux kernel efi stub had it's fair share of bugs too. On non-x86 architectures (arm, riscv, ...) all issues have been fixed a few releases ago. They all share much of the efi stub code base and also use the same self-decompressing method (CONFIG_EFI_ZBOOT=y).
On x86 this all took a bit longer to sort out. For historical reasons x86 can't use the zboot approach used by the other architectures. At least as long as we need hybrid BIOS/UEFI kernels, which most likely will be a number of years still.
The final x86 patch series has been merged during the 6.7 merge window. So we should have a fixed stable kernel in early January 2024, and distros picking up the new kernel in the following weeks or months. Which in turn should finally unblock shim updates.
There should be enough time to get everything sorted for the spring distro releases (Fedora 40, Ubuntu 24.04).
edk2 has a bunch of config options to fine tune the firmware behavior, both compile time and runtime. The relevant ones for the problems listed above are:
PcdDxeNxMemoryProtectionPolicy
Compile time option. Use the --pcd
switch for the
edk2 build
script to set these. It's bitmask, with
one bit for each memory type, specifying whenever the firmware
shoud apply memory protections for that particular memory type,
by setting the flags in the page tables accordingly.
Strict configuration is PcdDxeNxMemoryProtectionPolicy =
0xC000000000007FD5
. This is also the default for ArmVirt
builds.
Bug compatible configuration
is PcdDxeNxMemoryProtectionPolicy =
0xC000000000007FD1
. This excludes
the EfiLoaderData
memory type from memory
protections, so using EfiLoaderData
allocations for
code will not trigger page faults. Which is an very common
pattern seen in boot loader bugs.
PcdUninstallMemAttrProtocol
Compile time options, for ArmVirt only. Brand
new, committed
to the edk2 repo this week (Dec 12th 2023). When set to TRUE
the EFI_MEMORY_ATTRIBUTE_PROTOCOL
will be
unistalled.
Default is FALSE.
Setting this to TRUE will work around the shim bug.
opt/org.tianocore/UninstallMemAttrProtocol
Runtime option, for ArmVirt only. Also new. Can be set using
-fw_cfg on the qemu command line: -fw_cfg
name=opt/org.tianocore/UninstallMemAttrProtocol,string=y|n
.
This is a runtime override for PcdUninstallMemAttrProtocol.
Works for both enabling and disabling the shim bug workaround.
In the future PcdDxeNxMemoryProtectionPolicy
will
probably disappear in favor of memory profiles, which will allow to
configure the same settings (plus a few more) at runtime.
The default builds in the edk2-ovmf
and edk2-aarch64
packages are configured to be bug
compatible, so VMs should boot fine even in case the guests are
using a buggy boot chain.
While this is great for end users it doesn't help much for
bootloader development and testing, so there are alternatives.
The edk2-experimental
package comes with a collection
of builds better suited for that use case, configured with strict
memory protections and (on
aarch64) EFI_MEMORY_ATTRIBUTE_PROTOCOL
enabled, so you
can see buggy builds actually crash and burn. 🔥
For AARCH64 this
is /usr/share/edk2/experimental/QEMU_EFI-strictnx-pflash.raw
.
The magic words for libvirt are:
If a page fault happens you will get this line ...
Synchronous Exception at 0x00000001367E6578
... on the serial console, followed by a stack trace and register dump.
For X64 this
is /usr/share/edk2/experimental/OVMF_CODE_4M.secboot.strictnx.qcow2
.
Needs edk2-20231122-12.fc39
or newer. The magic words
for libvirt are:
It is also a good idea to add a debug console to capture the firmware log:
If you are lucky the page fault is logged there, also with an register dump. If you are not so lucky the VM will just reset and reboot.
The virt-firmware
project is a collection of python modules and scripts for working
with efi variables, efi varstores and also pe binaries. In case
your distro hasn't packages you can install it
using pip
like most python packages.
The virt-fw-vars
utility can work with efi varstores.
For example it is used to create the OVMF_VARS*secboot*
files, enrolling the secure boot certificates into the efi security
databases.
The simplest operation is to print the variable store:
When updating edk2 varstores virt-fw-vars
always needs
both input and output files. If you want change an existing
variable store both input and output can point to the same file.
For example you can turn on shim logging for an existing libvirt
guest this way:
The next virt-firmware version will get a new --inplace
switch to avoid listing the file twice on the command line for this
use case.
If you want start from scratch you can use an empty variable store
from /usr/share/edk2
as input. For example when
creating a new variable store template with the test CA certificate
(shipped with pesign.rpm) enrolled additionally:
The test CA will be used by all Fedora, CentOS Stream and RHEL build infrastructure to sign unofficial builds, for example when doing scratch builds in koji or when building rpms locally on your developer workstation. If you want test such builds in a VM, with secure boot enabled, this is a convenient way to do it.
Useful for having a look at EFI binaries is pe-inspect
.
If this isn't present try pe-listsigs
. Initially the
utility only listed the signatures, but was extended over time to
show more information, so I added the pe-inspect
alias
later on.
Below is the output for an 6.6 x86 kernel, you can see it does not have the patches to page-align the sections:
# file: /boot/vmlinuz-6.6.4-200.fc39.x86_64 # section: file 0x00000200 +0x00003dc0 virt 0x00000200 +0x00003dc0 r-x (.setup) # section: file 0x00003fc0 +0x00000020 virt 0x00003fc0 +0x00000020 r-- (.reloc) # section: file 0x00003fe0 +0x00000020 virt 0x00003fe0 +0x00000020 r-- (.compat) # section: file 0x00004000 +0x00df6cc0 virt 0x00004000 +0x05047000 r-x (.text) # sigdata: addr 0x00dfacc0 +0x00000d48 # signature: len 0x5da, type 0x2 # certificate # subject CN: Fedora Secure Boot Signer # issuer CN: Fedora Secure Boot CA # signature: len 0x762, type 0x2 # certificate # subject CN: kernel-signer # issuer CN: fedoraca
pe-inspect
also knows the names for a number of special
sections and supports decoding and pretty-printing them, for example
here:
# file: /usr/lib/systemd/boot/efi/systemd-bootx64.efi # section: file 0x00000400 +0x00011a00 virt 0x00001000 +0x0001191f r-x (.text) # section: file 0x00011e00 +0x00003a00 virt 0x00013000 +0x00003906 r-- (.rodata) # section: file 0x00015800 +0x00000400 virt 0x00017000 +0x00000329 rw- (.data) # section: file 0x00015c00 +0x00000200 virt 0x00018000 +0x00000030 r-- (.sdmagic) # #### LoaderInfo: systemd-boot 254.7-1.fc39 #### # section: file 0x00015e00 +0x00000200 virt 0x00019000 +0x00000049 r-- (.osrel) # section: file 0x00016000 +0x00000200 virt 0x0001a000 +0x000000de r-- (.sbat) # sbat,1,SBAT Version,sbat,1,https://github.com/rhboot/shim/blob/main/SBAT.md # systemd,1,The systemd Developers,systemd,254,https://systemd.io/ # systemd.fedora,1,Fedora Linux,systemd,254.7-1.fc39,https://bugzilla.redhat.com/ # section: file 0x00016200 +0x00000200 virt 0x0001b000 +0x00000084 r-- (.reloc)
The last utility I want introduce is virt-fw-sigdb
,
which can create, parse and modify signature databases. The
signature database format is used by the firmware to store
certificates and hashes in EFI variables. But sometimes the format
used for files too. virt-firmware has the functionality anyway, so
I've added a small frontend utility to work with those files.
One file in signature database format
is /etc/pki/ca-trust/extracted/edk2/cacerts.bin
which
contains the list of of trusted CAs in sigature database format.
Can be used to pass the CA list to the VM firmware for TLS
connections (https network boot).
Shim also uses that format when compiling multiple certificates into the built-in VENDOR_DB or VENDOR_DBX databases.
Thats it for today folks. Hope you find this useful.
]]>
On your linux machine you can use lscpu
to see the size
of the physical address space:
$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual ^^^^^^^^^^^^^^^^ [ ... ]
In /proc/iomem
you can see how the address space is
used. Note that the actual addresses are only shown to root.
The very first x86_64 processor (AMD Opteron) shipped with a physical address space of 40 bits (aka one TeraByte). So when qemu added support for the (back then) new architecture the qemu vcpu likewise got 40 bits of physical address space, probably assuming that this would be a safe baseline. It is still the default in qemu (version 8.1 as of today) for backward compatibility reasons.
Enter Intel. The first 64-bit processors shipped by Intel featured only 36 bits of physical address space. More recent Intel processors have 39, 42 or more physical address bits. Problem is this limit applies not only to the real physical addess space, but also to Extended Page Tables (EPT). Which means the physical address space of virtual machines is limited too.
So, the problem is the virtual machine firmware does not know how much physical address space it actually has. When checking CPUID it gets back 40 bits, but it could very well be it actually has only 36 bits.
To address that problem the virtual machine firmware was very conservative with address space usage, to avoid crossing the unknown limit.
OVMF used to have a MMIO window with fixed size (32GB), which was based on the first multiple of 32GB after normal RAM. So a typical, smallish virtual machine had 0 -> 32GB for RAM and 32GB -> 64GB for IO, staying below the limit for 36 bits of physical address space (which equals 64GB).
VMs having more than 30GB of RAM will need address space above 32GB for RAM, which pushes the IO window above the 64GB limit. The assumtion that hosts which have enough physical memory to run such big virtual machines also have a physical address space larger than 64GB seems to have worked good enough.
Nevertheless the fixed 32G-sized IO window became increasingly problematic. Memory sizes are growing, not only for main memory, but also for device memory. GPUs have gigabytes of memory these days.
Qemu has tree -cpu
options to control physical address
space advertized to the guest, for quite a while already.
off
(except for -cpu
host
where it is on
).
on
by default.
host-phys-bits=on
. Can be used to
reduce the number of physical address space bits communicated to
the guest. Useful for live migration compatibility in case your
machine cluster has machines with different physical address space
sizes.
host-phys-bits=off
. Can be used to
set the number of physical address space bits to any value you
want, including non-working values. Use only if you know what you
are doing, it's easy to shot yourself into the foot with this one.
Recent OVMF versions (edk2-stable202211 and newer) try to figure the
size of the physical address space using a heuristic: In case the
physical address space bits value received via CPUID is 40 or below
it is checked against known-good values, which are 36 and 39 for
Intel processors and 40 for AMD processors. If that check passes or
the number of bits is 41 or higher OVMF assumes qemu is configured
with host-phys-bits=on
and the value can be trusted.
In case there is no trustworthy phys-bits value OVMF will continue with the traditional behavior described above.
In case OVMF trusts the phys-bits value it will apply some OVMF-specific limitations before actually using it:
The final phys-bits value will be used to calculate the size of the physical address space available. The 64-bit IO window will be placed as high as possibe, i.e. at the end of the physical address space. The size of the IO window and also the size of the PCI bridge windows (for prefetchable 64-bit bars) will be scaled up with the physical address space, i.e. on machines with a larger physical address space you will also get larger IO windows.
Starting with version 1.16.3 SeaBIOS uses a heuristic simliar to OVMF to figure whenever there is a trustworthy phys-bits value.
If that is the case SeaBIOS will enable the 64-bit IO window by default and place it at the end of the address space like OVMF does. SeaBIOS will also scale the size of the IO window with the size of the address space.
Although the overall behavior is simliar there are some noteworthy differences:
Starting with release 8.2 the firmware images bundled with upstream qemu are new enough to include the OVMF and SeaBIOS changes described above.
The new firmware behavior triggered a few bugs elsewhere ...
When doing live migration the vcpu configuration on source and target host must be identical. That includes the size of the physical address space.
libvirt can calculate the cpu baseline for a given cluster, i.e. create a vcpu configuration which is compatible with all cluster hosts. That calculation did not include the size of the physical address space though.
With the traditional, very conservative firmware behavior this bug did not cause problems in practice, but with OVMF starting to use the full physical address space live migrations in heterogeneous clusters started to fail because of that.
In libvirt 9.5.0 and newer this has been fixed.
In general, it is a good idea to set the qemu
config option host-phys-bits=on
.
In case guests can't deal with PCI bars being mapped at high
addresses the host-phys-bits-limit=bits
option
can be used to limit the address space usage. I'd suggest to stick
to values seen in actual processors, so 40 for AMD and 39 for Intel
are good candidates.
In case you are running 32-bit guests with alot of memory (which btw
isn't a good idea performance-wise) you might need turn off long
mode support to force the PCI bars being mapped below 4G. This can
be done by simply using qemu-system-i386
instead
of qemu-system-x86_64
, or by explicitly
setting lm=off
in the -cpu
options.
Some people already noticed and asked questions. So guess I better write things down in my blog so I don't have to answer the questions over and over again, and I hope to also clarify some things on distro firmware builds.
So, yes, the jenkins autobuilder creating the firmware repository at https://www.kraxel.org/repos/jenkins/ has been shutdown yesterday (Jul 19th 2022). The repository will stay online for the time being, so your establish workflows will not instantly break. But the repository will not get updates any more, so it is wise to start looking for alternatives now.
The obvious primary choice would be to just use the firmware builds provided by your distribution. I'll cover edk2 only, which seems to be the by far most popular use, even thought here are also builds for other firmware projects.
Given I'm quite familier with the RHEL / Fedora world I can give
some advise here. The edk2-ovmf
package comes with
multiple images for the firmware code and the varstore template
which allow for various combinations. The most important ones are:
OVMF_CODE.secboot.fd
and OVMF_VARS.secboot.fd
OVMF_CODE.secboot.fd
and OVMF_VARS.fd
OVMF_CODE.fd
and OVMF_VARS.fd
The classic way to setup this in libvirt looks like this:
To make this easier the firmware builds come with json files
describing the capabilities and requirements. You can find these
files in /usr/share/qemu/firmware/
. libvirt can use
them to automatically find suitable firmware images, so you don't
have to write the firmware image paths into the domain
configuration. You can simply use this instead:
libvirt also allows to ask for specific firmware features. If you don't want use secure boot for example you can ask for the blank varstore template (no secure boot keys enrolled) this way:
In case you change the configuration of an existing virtual machine
you might (depending on the kind of change) have to run virsh
start --reset-nvram domain
once to to start over with
a fresh copy of the varstore template.
The world has moved forward. UEFI isn't a niche use case any more. Linux distributions all provide good packages theys days. The edk2 project got good CI coverage (years ago it was my autobuilder raising the flag when a commit broke the gcc build). The edk2 project got a regular release process distros can (and do) follow.
All in all the effort to maintain the autobuilder doesn't look justified any more.
]]>
To build edk2 you need to have a bunch of tools installed. An
compiler and the make
are required of course, but also
iasl
, nasm
and libuuid
. So
install them first (package names are for centos/fedora).
If you want cross-build arm firmware on a x86 machine you also need cross compilers. While being at also set the environment variables needed to make the build system use the cross compilers:
Next clone the tiaocore/edk2 repository and also fetch the git submodules.
The edksetup
script will prepare the build environment
for you. The script must be sourced because it sets some
environment variables (WORKSPACE
being the most
important one). This must be done only once (as long as you keep
the shell with the configured environment variables open).
Next step is building the BaseTools (also needed only once):
Note: Currently (April 2022) BaseTools are being rewritten in Python, so most likely this step will not be needed any more at some point in the future.
Finally the build (for x64 qemu) can be kicked off:
The firmware volumes built can be found
in Build/OvmfX64/DEBUG_GCC5/FV
.
Building the aarch64 firmware instead:
The build results land
in Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/FV
.
Qemu expects the aarch64 firmware images being 64M im size. The firmware images can't be used as-is because of that, some padding is needed to create an image which can be used for pflash:
There are a bunch of compile time options, typically enabled
using -D NAME
or -D NAME=TRUE
. Options
which are enabled by default can be turned off using -D
NAME=FALSE
. Available options are defined in
the *.dsc
files referenced by the build
command. So a feature-complete build looks more like this:
Secure boot support (on x64) requires SMM mode. Well, it builds and works without SMM, but it's not secure then. Without SMM nothing prevents the guest OS writing directly to flash, bypassing the firmware, so protected UEFI variables are not actually protected.
Also suspend (S3) support works with enabled SMM only in case parts of the firmware (PEI specifically, see below for details) run in 32bit mode. So the secure boot variant must be compiled this way:
The FD_SIZE_4MB
option creates a larger firmware image,
being 4MB instead of 2MB (default) in size, offering more space for
both code and vars. The RHEL/CentOS builds use that. The Fedora
builds are 2MB in size, for historical reasons.
If you need 32-bit firmware builds for some reason, here is how to do it:
The build results will be in
in Build/ArmVirtQemu-ARM/DEBUG_GCC5/FV
and
Build/OvmfIa32/DEBUG_GCC5/FV
The x86 firmware builds create three different images:
OVMF_VARS.fd
/var/lib/libvirt/qemu/nvram
.
OVMF_CODE.fd
OVMF.fd
CODE
and VARS
. This can be loaded as ROM
using -bios
, with two drawbacks: (a) UEFI variables
are not persistent, and (b) it does not work
for SMM_REQUIRE=TRUE
builds.
qemu handles pflash storage as block devices, so we have to create block devices for the firmware images:
Here is the arm version of that (using the padded files created
using dd
, see above):
The core edk2 repo holds a number of packages, each package has its own toplevel directory. Here are the most interesting ones:
The firmware modules in the edk2 repo often named after the boot
phase they are running in. Most drivers are
named SomeThingDxe
for example.
pip3
install ovmfctl
. The project is hosted
at gitlab.
Usage: ovmfctl --input file.fd
.
It's a debugging tool which just prints the structure and content of firmware volumes.
This is a tool to print and modify variable store volumes. Main focus has been on certificate handling so far.
Enrolling certificates for secure boot support in virtual machines has been a rather painfull process. It's handled by EnrollDefaultKeys.efi which needs to be started inside a virtual machine to enroll the certificates and enable secure boot mode.
With ovmfctl it is dead simple:
This enrolls the Red Hat Secure Boot certificate which is used by
Fedora, CentOS and RHEL as platform key. The usual Microsoft
certificates are added to the certificate database too, so windows
guests and shim.efi
work as expected.
If you want more fine-grained control you can use
the --set-pk
, --add-kek
, --add-db
and --add-mok
switches instead.
The --enroll-redhat
switch above is actually just a shortcut for:
If you just want the variable store be printed use ovmfctl
--input file.fd --print
. Add --hexdump
for more details.
Extract all certificates: ovmfctl --input file.fd
--extract-certs
.
Try ovmfctl --help
for a complete list of command line
switches. Note that Input and output file can be indentical for
inplace updates.
That's it. Enjoy!
]]>
Most of my machines have a
local postfix configured for
outgoing mail. My workstation and my laptop forward all mail (over
vpn) to the company internal email server. All I need for this to
work is a relayhost line in /etc/postfix/main.cf
:
Most unix utilities (including git send-email
) try to
send mails using /usr/sbin/sendmail
by default. This
tool will place the mail in the postfix queue for processing. The
name of the binary is a convention dating back to the days
where sendmail
was the one and only unix mail processing daemon.
All my mail is synced to local maildir storage. I'm using offlineimap for the job. Plenty of other tools exist, isync is another popular choice.
Local mail storage has the advantage that reading mail is faster, especially in case you have a slow internet link. Local mail storage also allows to easily index and search all your mail with notmuch.
I'm using server side filtering. The major advantage is that I always have the same view on all my mail. I can use a mail client on my workstation, the web interface or a mobile phone. Doesn't matter, I always see the same folder structure.
All modern email clients should be able to use maildir folders. I'm using neomutt. I also have used thunderbird and evolution in the past. All working fine.
The reason I use neomutt is that it is simply faster than GUI-based mailers, which matters when you have to handle alot of email. It is also easy very to hook up scripts, which is very useful when it comes to patch processing.
I'm using git send-email
for the simple cases
and git-publish
for the more complex ones. Where "simple" typically is
single changes (not a patch series) where it is unlikely that I have
to send another version addressing review comments.
git publish
keeps track of the revisions you have sent
by storing a git tag in your repo. It also stores the cover letter
and the list of people Cc'ed on the patch, so sending out a new
revision of a patch series is much easier than with plain git
send-email
.
git publish
also features config profiles. This is
helpful for larger projects where different subsystems use different
mailing lists (and possibly different development branches too).
So, here comes the more interesting part: Hooking scripts into
neomutt for patch processing. Lets start with the config
(~/.muttrc
) snippet:
First I map the 'p' key to noop
(instead
of print
which is the default configuration), which
allows to use two-key combinations starting with 'p' for patch
processing. Then 'pa' is configured to run
my patch-apply.sh
script, and 'pl'
runs patch-lore.sh
.
Lets have a look at the patch-apply.sh
script which
applies a single patch:
The mail is passed to the script on stdin, so the first thing the
script does is to store that mail in a temporary file. Next it goes
try figure which project the patch is for. The logic for that is in
a separate file so other scripts can share it, see below. Finally
try to apply the patch using git am
. In case of a
failure store both decoded patch and complete email before cleaning
up and exiting.
Now for patch-find-project.sh
. This script snippet
tries to figure the project by checking which mailing list the mail
was sent to:
The PATCH_PROJECT environment variable can be used to override the autodetect logic if needed.
Last script is patch-lore.sh
. That one tries to apply
a complete patch series, with the help of
the b4 tool. b4 makes
patch series management an order of magnitude simpler. It will find
the latest revision of a patch series, bring the patches into the
correct order, pick up tags (Reviewed-by, Tested-by etc.) from
replies, checks signatures and more.
First part (store mail, find project) of the script is the same
as patch-apply.sh
. Then the script goes get the
message id of the mail passed in and feeds that into b4. b4 will go
try to find the email thread
on lore.kernel.org. In case
this doesn't return results the script will go query notmuch for the
email thread instead and feed that into b4 using
the --use-local-mbox
switch.
Finally it tries to apply the complete patch series prepared by b4
with git am
.
So, with all that in place applying a patch series is just two key strokes in neomutt. Well, almost. I still need an terminal on the side which I use to make sure the correct branch is checked out, to run build tests etc.
]]>This article is not about the basics of setting up a boot server. The internet has tons of tutorials on how to install a tftp server and how to boot your favorite OS from tftp. This article will focus on configuring network boot for libvirt-managed virtual machines.
The config file snippets are examples from my home
network, home.kraxel.org
is the local domain
and 192.168.2.14
is the machine acting as boot server
here. You have to replace those to match your setup of course. The
same is true for the boot file names.
The default libvirt network uses 192.168.122.0/24
. In
case you use that unmodified these addresses will work fine for you
and in fact they should already be in your libvirt network
configuration. If you have changed the default libvirt network I
expect you know what you have to do 😎.
That is pretty simple. libvirt has support for that, so all you
have to do is adding a bootp
tag with the ip address of
your tftp server and the boot file name to the network config.
You can edit the network configuration using virsh
net-edit name
. The default libvirt network is simply
named default
. The network needs an restart to apply
any changes (virsh net-destroy name; virsh
net-start name
).
That was easy, right? Well, maybe not. In case this is not working
for you try running modprobe nf_nat_tftp
. tftp uses
udp, which means there are no connections at ip level, so the kernel
has to look into the tftp packets to figure how to route them
correctly for a masqueraded network. The nf_nat_tftp
kernel module does exactly that.
Note: Recent libvirt versions seem to take care to
load nf_nat_tftp
if needed, so there is a chance this
works out-of-the-box for you.
Neverthelless that leads straight to the question: do we actually need tftp?
As you might have guessed the answer is no.
The ipxe boot roms support booting from http, by simply specifying an URL instead of a filename as bootfile. This was never formally specified though, so unfortunaly you can't expect this to work with every boot rom. For qemu-powered virtual machines this isn't a problem at all because the qemu boot roms are built from ipxe. With physical machines you might have to hop though some extra loops to chainload ipxe (not covered here).
The easiest way to get this going is to install apache on your tftp
boot server, then configure a virtual host with the tftproot as
document root. You can do so by dropping a snippet like this
into /etc/httpd/conf.d/
:
Enabling Indexing is not needed for boot server functionality, but might be handy if you want access the boot server with your web browser for trouble-shooting.
Using the tftproot as document root has the advantage that the paths are identical for both tftp and http boot, so your pxelinux and grub configuration files should continue to work unmodified.
Now you can go edit your libvirt network config and replace
the bootp
configuration with this:
Done. Don't forget to restart the network to apply the changes. Booting should be noticable faster now (especially when fetching larger initrds), and any NAT traversal problems should be gone too.
When using http you can boot from pretty much any server on the
internet, there is no need to setup your own. You can use for
example the boot server provided
by netboot.xyz with a large
collection of operating systems available as live systems and for
install. Here is the bootp
snippet for this:
In most cases probably want have a local boot server for faster installs. But for a one time test install of a new distro this might be more handy than downloading the install iso.
For EFI guests the pxelinux.0 is pretty much useless indeed, so we
must do something else for them. First question is how do we figure
this is a EFI guest asking for a boot file? Lets have a look at the
dhcp request, BIOS guest goes first. Captured using tcpdump
-i virbr0 -v port bootps
:
[ ... ] 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:89:32:47 [ ... ] Client-Ethernet-Address 52:54:00:89:32:47 (oui Unknown) Vendor-rfc1048 Extensions [ ... ] ARCH Option 93, length 2: 0 Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"
Now a request from a (x64) EFI guest:
[ ... ] 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:89:32:47 [ ... ] Client-Ethernet-Address 52:54:00:89:32:47 (oui Unknown) Vendor-rfc1048 Extensions { ... ] ARCH Option 93, length 2: 7 Vendor-Class Option 60, length 32: "PXEClient:Arch:00007:UNDI:003001"
See? The EFI guest uses arch 7 instead of 0, in both option 93 and option 60. So we will use that.
Unfortunaly libvirt has no direct support for that. But libvirt uses dnsmasq as dhcp (and dns) server for the virtual networks. dnsmasq has support for this, and starting with libvirt version 5.6.0 it is possible to specify any dnsmasq config option in your libvirt network configuration using the dnsmasq xml namespace.
dnsmasq uses the concept of tags to implement this. Requests can be
tagged using matches, and configurartion directives can be applied
to requests with certain tags. So, here is how it looks like, using
the efi-x64-pxe
tag for x64 efi guests
and /arch-x86_64/grubx64.efi
as bootfile.
dnsmasq uses '#' for comments, and it is here only to visually
separate entries a bit. It will also be in the dnsmasq config files
created by libvirt (in /var/lib/libvirt/dnsmasq/
).
Sure. You might have already noticed that the UEFI boot manager has
both UEFI PXEv4
and UEFI HTTPv4
entries.
Here is what happens when you pick the latter:
[ ... ] 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:89:32:47 [ ... ] Client-Ethernet-Address 52:54:00:89:32:47 (oui Unknown) Vendor-rfc1048 Extensions [ ... ] ARCH Option 93, length 2: 16 Vendor-Class Option 60, length 33: "HTTPClient:Arch:00016:UNDI:003001"
It's arch 16 now. Also option 60 starts
with HTTPClient
instead of PXEClient
.
So we can simply add another arch match to identify http clients.
Another detail we need to take care of is that the UEFI http boot
client expect a reply with option 60 set to HTTPClient
,
otherwise it will be ignored. So we need to take care of that too,
using dhcp-option-force
. Here we go, using
tag efi-x64-http
for http clients:
Complete example, defining a new libvirt network
named netboot.xyz
. You can store that in some file, then
use virsh net-define file
to create the network.
Then, in your guest domain configration, use <source
network='netboot.xyz'/>
to use the new network. With this
both BIOS and UEFI guests can netboot from netboot.xyz. With UEFI
you have to take care to pick the UEFI HTTPv4
entry
from the firmware boot menu.
There is a world beyond x86. The arch field does not only specify the system architecture (bios vs. uefi) or the boot protocol (pxe vs. http), but also the cpu architecture. Here are the ones relevant for qemu:
Code | Architecture |
---|---|
0x00 | BIOS pxeboot (both i386 and x86_64) |
0x06 | EFI pxeboot, IA32 (i386) |
0x07 | EFI pxeboot, X64 (x86_64) |
0x0a | EFI pxeboot, ARM (v7) |
0x0b | EFI pxeboot, AA64 (v8 / aarch64) |
0x12 | powerpc64 |
0x16 | EFI httpboot, X64 |
0x18 | EFI httpboot, ARM |
0x19 | EFI httpboot, AA64 |
0x31 | s390x |
So, if you want play with arm or powerpc without owning such a machine you can let qemu emulate it with tcg. If you want netboot it -- no problem, just add a few more lines to your network configuration. Here is an example for aarch64:
In case you are wondering why I place the grub binaries in
subdirectories: grub tries fetch the config file from the same
directory, so that way I get per-arch config files and they are
named /arch-aarch64/grub.cfg
, /arch-x86_64/grub.cfg
and so on. A nice side effect is that the toplevel directory is a
bit less cluttered with files.
Well, the fundamental idea doesn't change. Look at arch option, then send different replies depending on what you find there. With other dhcp servers the syntax is different, but the pattern is the same. Here is a sample snippet for the isc dhcp server shipped with most linux distributions:
option arch code 93 = unsigned integer 16; subnet 192.168.2.0 netmask 255.255.255.0 { [ ... ] if (option arch = 00:16) { option vendor-class-identifier "HTTPClient"; filename "http://boot.home.kraxel.org/arch-x86_64/grubx64.efi"; } else if (option arch = 00:07) { next-server 192.168.2.14; filename "/arch-x86_64/grubx64.efi"; } else { next-server 192.168.2.14; filename "/pxelinux.0"; } }]]>
After mentioning it here and there some people asked for details, so here we go. I'll go describe my setup, with some kubernetes and container basics sprinkled in.
This is part one of an article series and will cover cluster node installation and basic cluster setup.
Most cluster nodes are dual-core virtual machines. The
control-plane node (formerly known as master node) has 8G of memory, most worker
nodes have 4G of memory. It is a mix of x86_64
and aarch64
nodes. Kubernetes names these
architectures amd64
and arm64
, which is
easily confused, so take care 😎.
The virtual nodes
use bridged
networking. So no separate network, they simply show up on my
192.168.2.0/24
home network like the physical machines
connected. They get a static IP address assigned by the DHCP
server, and I can easily ssh into each node.
All cluster nodes run Fedora 34, Server Edition.
I have a git repository with some config files, to simplify rebuilding a cluster node from scratch. The repository also has some shell scripts with the commands listed later in this blog post.
Lets go over the config files one by one.
This is needed for kubernetes networking.
Load some kernel modules needed at boot. Again for kubernetes networking. Also vhost support which is needed by kata containers.
The upstream kubernetes rpm repository. Note this is not enabled
(enabled=0
) because I don't want normal fedora system
updates also update the kubernetes packages. For
installing/updating kubernetes packages I can enable the repo
using dnf --enablerepo=kubernetes ...
.
Given I want play with different container runtimes I've decided to use cri-o, which allows to do just that. Fedora has packages. They are in a module though, so that must be enabled first.
The cri-o version should match the kubernetes version you want run. That is not the case in my cluster right now because I've learned that after setting up the cluster, so obviously sky isn't falling in case they don't match. The next time I update the cluster I'll bring them into sync.
Now we can go install the packages from the fedora repos. cri-o, runc (default container runtime), and a handful of useful utilities.
Next in line are the kubernetes packages from the google repo. The repo has all versions, not only the most recent, so you can ask for the version you want and you'll get it. As mentioned above the repo must be enabled on the command line.
kubelet needs some configuration, my git repo with the config files has this:
Asking kubelet to delegate all cgroups work to systemd is needed to make kubelet work with cgroups v2. With that in place we can reload the configuration and start the services:
Kubernetes cluster nodes need a few firewall entries so the nodes can speak to each other. I was to lazy to setup all that and just turned off the firewall. The cluster isn't reachable from the internet anyway, so 🤷.
All the preparing steps up to this point are the same for all cluster nodes. Now we go initialize the control plane node.
Picked the 10.85.0.0/16
network because that happens to
be the default network used by cri-o,
see /etc/cni/net.d/100-crio-bridge.conf
.
This command will take a while. It will pull kubernetes container images from the internet, start them using the kubelet service, and finally initialize the cluster.
kubeadm
will write the config file needed to access the
cluster with kubectl
to /etc/kubernetes/admin.conf
. It'll make you cluster
root. Kubernetes names this cluster-admin
role in the
rbac (role based access control) scheme.
For my devel cluster I simply use that file as-is instead of setting
up some more advanced user authentication and access control. I
place a copy of the file at $HOME/.kube/config
(the
default location used by kubectl). Copying the file to other
machines works, so I can also run kubectl on my laptop or
workstation instead of ssh'ing into the control plane node.
Time to run the first kubectl
command to see whenever
everything worked:
Yay! First milestone.
By default kubeadm init
adds a taint to the control
plane node so kubernetes wouldn't schedule pods there:
If you want go for a single node cluster all you have to do is remove that taint so kubernetes will schedule and run your pods directly on your new and shiny control plane node. The magic words for that are:
Done. You can start playing with the cluster now.
If you want add one or more worker nodes to the cluster instead, then watch kubernetes distribute the load, read on ...
The worker nodes need a bootstrap token to authenticate when they
want join the cluster. The kubeadm init
command
creates a token and will also print the kubeadm join
command needed to join. If you don't have that any more, no
problem, you can always get the token later using kubeadm
token list
. In case the token did expire (they are valid for
a day or so) you can create a new one using kubeadm token
create
. Beside the token kubeadm also needs the hostname and
port to be used to connect to the control plane node. Default port
for the kubernetes API is 6443, so ...
... and check results:
The node may show up in "NotReady" state for a while when it did register already but didn't complete initialization yet.
Now repeat that procedure on every node you want add to the cluster.
Both kubeadm
and kubectl
can return the
data you ask for in various formats. By default they print a nice,
human-readable table to the terminal. But you can also ask for
yaml, json and others using the -o
or --output
switch. Specifically json is very useful
for scripting, you can pipe the output through
the jq utility (you
might have noticed this in the list of packages to install at the
start of this blog post) to fish out the items you actually need.
For starters two simple examples. You can get the raw bootstrap token this way:
Or check out some node details:
There are way more possible use cases. When reading config and
patch files kubectl
likewise accepts both yaml and
json as input.
There is one more basic thing to setup: Install a network fabric to get the pod network going. This is needed to allow pods running on different cluster nodes to talk to each other. When running a single node cluster this can be skipped.
There are a bunch of different solutions out there, I've settled
for flannel in
"host-gw" mode. First
download kube-flannel.yml
from github. Then tweak the configuration: Make sure the network
matches the pod network passed to kubeadm init
, and
change the backend. Here are the changes I've made:
Now apply the yaml file to install flannel:
The flannel pods are created in the kube-system namespace, you can check the status this way:
Once all pods are up and running your pod network should be working. One nice thing with "host-gw" mode is that this uses standard network routing of the cluster nodes and you can inspect the state with standard linux tools:
Each cluster node gets a /24 subnet of the pod network assigned.
The cni0
device is the subnet of the local node. The
other subnets are routed to the other cluster nodes. Pretty
straight forward.
So, that's it for part one. The internet has tons of kubernetes tutorials and examples which you can try on the cluster now. One good starting point is Kubernetes by example.
My plan for part two of this article series is installing and configuring some useful cluster services, with one of them being ingress which is needed to access your cluster services with a web browser.
]]>So, what are the choices for implementing cut+paste support? Without guest cooperation the only possible way would be to send text as keystrokes to the guest. Which has a number of drawbacks:
So, this is not something to consider seriously. Instead we need help from the guest, which is typically implemented with some agent process running inside the guest. The options are:
Reusing the spice agent has some major advantages. For starters
there is no need to write any new guest code for this. Less work
for developers and maintainers. Also the agent is packaged since
years for most distributions (typically the package is
named spice-vdagent
). So it is easily available,
making things easier for users, and guest images with the agent
installed work out-of-the-box.
Downside is that this is a bit confusing as you need the spice agent in the guest even when not using spice on the host. So I'm writing this article to address that ...
The spice guest agent is not a single process but two: One global
daemon running as system service (spice-vdagentd
) and
one process (spice-vdagent
) running in desktop session
context.
The desktop process will handle everything which needs access to your display server. That includes cut+paste support. It will also talk to the system service. The system service in turn connects to the host using a virtio-serial port. It will relay data messages between desktop process and host and also process some of the requests (mouse messages for example) directly.
On the host side qemu simply forwards the agent data stream to the spice client and visa versa. So effectively the spice guest agent can communicate directly with the spice client. It's configured this way:
/dev/virtio-ports/com.redhat.spice.0
inside the
guest.
The central piece of code is the new qemu clipboard manager
(ui/clipboard.c
). Initially it supports only plain
text. The interfaces are designed for multiple data types though,
so adding support for more data types later on is possible.
There are three peers which can talk to the qemu clipboard manager:
ui/vnc-clipboard.c
), so vnc clients with cut+paste
support can exchange data with the qemu clipboard.ui/gtk-clipboard.c
) and connects the qemu clipboard
manager with your desktop clipboard.ui/vdagent.c
), which connects the guest to the qemu
clipboard.This landed in the qemu upstream repo a few days ago and will be shipped with the qemu 6.1 release.
The qemu vdagent is implemented as chardev. It is a drop-in replacement for the spicevmc chardev, and instead of forwarding everything to the spice client it implements the spice agent protocol and parses the messages itself. So only the chardev configuration changes, the virtserialport stays as-is:
The vdagent has two options to enable/disable vdagent protocol features:
No immediate plans right now, but I have some ideas what could be done:
Maybe I look into them when I find some time. No promise though. Patches are welcome.
]]>