In version 4.2 the microvm machine type was added to qemu. The initial commit describes it this way:
It's a minimalist machine type without PCI nor ACPI support, designed for short-lived guests. microvm also establishes a baseline for benchmarking and optimizing both QEMU and guest operating systems, since it is optimized for both boot time and footprint.
The initial code uses the minimal qboot firmware to initialize the guest, to load a linux kernel and boot it. For network/storage/etc devices virtio-mmio is used. The configuration is passed to the linux kernel on the command line so the guest is able to find the devices.
That works for direct kernel boot, using
vmlinuz, because qemu can easily patch the kernel command
line then. But what if you want - for example - boot the Fedora
Cloud image? Using the Fedora kernel stored within the image?
A better plan for device discovery
When not using direct kernel boot patching the kernel command line for device discovery isn't going to fly, so we need something else. There are two established standard ways to do that in modern systems. One is device trees. The other is ACPI. Both have support for virtio-mmio.
A device tree entry for virtio-mmio looks like this:
And this is the ACPI DSDT version:
Both carry essentially the same information: What kind of device that is and which resources (registers & interrupt) it uses.
On the arm platform both are established, with device trees being common for small board computers like the raspberry pi and ACPI being used in the arm server space. On the x86 platform we don't have much of a choice though. There are some niche attempts to establish device trees, the google android goldfish platform for example. But for widespread support there is no way around using ACPI.
The nice thing about arm server using ACPI too is that this paved the way for us. The linux kernel supports both device tree and ACPI for the discovery of virtio-mmio devices:
root@fedora-bios ~/acpi# modinfo virtio-mmio | grep alias alias: of:N*T*Cvirtio,mmioC* alias: of:N*T*Cvirtio,mmio alias: acpi*:LNRO0005:*
So linux kernel support is a solved problem already. Yay!
virtio-mmio support for seabios
If we want load a kernel from the disk image the firmware must be able to find and read the disk. We already have virtio-pci support for blk and scsi in seabios. The differences between virtio-pci and virtio-mmio transports are not that big. Also some infrastructure for different transport modes was already there to deal with legacy vs. modern virtio-pci. So adding virtio-mmio support to the drivers wasn't much of a problem.
But of course seabios also has the problem that it must discover the devices before it can initialize the driver. Various approaches to find virtio-mmio devices using the available information sources where tried. All of them had the one or the other non-working corner case, except using ACPI. So seabios ended up getting a simple DSDT parser for device discovery.
While being at it some other small fixes where added to seabios too
to make it work better with microvm. The hard dependency on the RTC
CMOS has been removed for example, so latest seabios works fine with
qemu -M microvm,rtc=off.
This ships with seabios version 1.14.
While speaking about seabios: When using a serial console I'd
strongly recommend to run with
microvm,graphics=off. That will enable serial console
support in seabios. This is one of the tweaks done by
qemu -nographic shortcut. The machine option works
q35 machine types too.
ACPI cleanups in qemu
Hooking up ACPI support for microvm on the qemu side turned out to be surprisingly difficuilt due to some historical baggage.
Years ago qemu used to have a static ACPI DSDT table. All ISA devices (serial & parallel ports, floppy, ...) are declared there, but they might not be actually present depending on qemu configuration. The LPC/ISA bride has some bits in pci config space saying whenever a device is actually present or not (qemu emulation follows physical hardware behavior here). So the devices have a _STA method looking up those bits and returning the device status. The guest had to run the method using AML interpreter to figure whenever the declared device is actually there.
The microvm machine type simply has no PCI support, so that approach isn't going to fly. Also these days all ACPI tables are dynamically generated anyway, so there is no reason to have the guests AML interpreter go dig into pci config space. Instead we can handle that in qemu when generating the DSDT table. Disabled devices are simply not listed. For enabled devices this is enough:
So I've ended up reorganizing and simplifying the code which creates the DSDT entries for ISA devices. This landed in qemu version 5.1.
ACPI support for microvm
Now with the roadblocks out of the way it was finally possible to add acpi support to microvm. There is little reason to worry about backward compatibility to historic x86 platforms here, old guests wouldn't be able to handle virtio-mmio anyway. So this takes a rather modern approach and looks more like an arm virt machines than a x86 q35 machine. Like arm it uses the generic event device for power management support.
ACPI support for microvm is switchable, simliar to the other machine
types, using the
acpi=on|off machine option. The
-no-acpi switch works too. By default ACPI support is
With ACPI enabled qemu uses virtio-mmio enabled seabios as firmware and doesn't bother patching the linux kernel command line for device discovery.
With ACPI disabled qemu continues to use qboot as firmware like older qemu versions do. Likewise it continues to add virtio-mmio devices to the linux kernel command line.
This will be available in qemu version 5.2. It is already merged in the master branch.
Number one is device discovery obviously, this is why we started all this in the first place. seabios and linux kernel find virtio-mmio devices automatically. You can boot Fedora cloud images in microvm without needing any tricks. Probably other distros too, even though I didn't try that. Compiling the linux kernel with
CONFIG_VIRTIO_MMIO=y(or =m & adding the module to initramfs) is pretty much the only requirement for this to work.
Number two is device discovery too. ACPI will also tell the kernel which devices are not there. So with
acpi=onthe kernel simply skips the PS/2 probe in case the DSDT doesn't list a keyboard controller. With
acpi=offthe kernel assumes legacy hardware, goes into probe-harder mode and needs one second to figure that there really is no keyboard controller:
[ 0.414840] i8042: PNP: No PS/2 controller found. [ 0.415287] i8042: Probing ports directly. [ 1.454723] i8042: No controller found
We have an simliar effect with the real time clock. With
acpi=offthe kernel goes register an IRQ line for the RTC even in case the device isn't there.
Number three is (basic) power management. ACPI provides a virtual power button, so the guest will honor shutdown requests sent that way. ACPI also provides S5 (aka poweroff) support, so qemu gets a notification from the guest when the shutdown is done and can exit.
Number four is better IRQ routing. The linux kernel finds the IO-APIC declared in the APIC table and uses it for IRQ routing. It is possible to use lines 16-23 for the virtio-mmio devices, avoiding IRQ sharing. Also we can refine the configuration using IRQ flags in the DSDT table.
With acpi=off this does not work not reliable. I've seen the kernel ignore the IO-APIC in the past. Doesn't always happen though. Not clear which factors play a role here, I didn't investigate that in detail. Maybe newer kernel versions are a bit more clever here and find the IO-APIC even without ACPI.
Bottom line: ACPI helps moving the microvm machine type forward towards a world without legacy x86 platform devices.
But isn't ACPI bloated and slow?
Well, on my microvm test guest all ACPI tables combined are less than 1k in size:
root@fedora-bios /sys/firmware/acpi/tables# find -type f | xargs ls -l -r--------. 1 root root 78 Oct 2 09:36 ./APIC -r--------. 1 root root 482 Oct 2 09:36 ./DSDT -r--------. 1 root root 268 Oct 2 09:36 ./FACP
I wouldn't call that bloated. This is a rather small virtual machine, with larger configurations (more CPUs, more devices) the tables will grow a bit of course.
When testing boot times I figured it is pretty hard to find any
differences due to ACPI initialization. The noise (differences when
doing 2-3 runs with identical configuration) is larger than the
acpi=on/off difference. Seems to be at most a handful
When trying that yourself take care to boot the kernel with 'quiet'.
This is a good idea anyway if you want boot as fast as possible.
The kernel prints more boot information with
so slow console logging can skew your numbers if you let the kernel
print out everything.
Runtime differences should be zero. There is only one AML method in the DSDT table. It toggles the power button when a notification comes in from the generic event device. It runs only on generic event device interrupts.
USB support for microvm
qemu just got a sysbus (non-pci) version of the xhci host adapter. It is needed for some arm boards. Nice thing is now that we have ACPI we can just wire that up in microvm too, add it in the DSDT table, then linux will find and use it:
USB support will be disabled by default, it can be enabled using
the usual machine option:
qemu -M microvm,usb=on.
Patches for qemu are in flight, should land in version 5.2
Patches for seabios are merged, will be available in version 1.15
PCIe support for microvm
There is one more arm platform thing we can reuse in microvm: The PCI Express host bridge. Again the same approach: Wire everything up, declare it in the ACPI DSDT. Linux kernel finds and uses it.
Not adding an asl snippet this time. The PCIe host bridge is a complex device so the description is a bit larger. It has GSI subdevices, IRQ routing information for each PCI slot, mmconfig configuration etc. Also shows in the DSDT size (even though that is still less than half the size q35 has):
root@fedora-bios /sys/firmware/acpi/tables# ll DSDT -r--------. 1 root root 3130 Oct 2 10:20 DSDT
PCIe support will be disabled by default, it can be enabled using
the new pcie machine option:
qemu -M microvm,pcie=on.
This will be available in qemu version 5.2. It is already merged in the master branch.
My TODO list for qemu isn't very long:
Add second IO-APIC, allowing more IRQ lines for more virtio-mmio devices. Experimental patches exist.
IOMMU support, using virtio-iommu. Depends on ACPI spec update for virtio-iommu being finalized and support being merged into qemu and linux kernel. The actual wireup for microvm should be easy once all this is done.
Outside qemu there are a few more items:
Investigate microvm PCIe support in seabios. Experimental patches exist. I'm not sure yet whenever seabios should care though.
So, is it worth the effort? The benefit would be that seabios could support booting from pci devices on microvm then.
Maybe add microvm support to edk2/ovmf.
Looks not that easy on a quick glance. ArmVirtPkg depends on device trees for virtio-mmio detection, so while we can re-use the virtio-mmio drivers we can not re-use device discovery code. Unless we maybe have qemu provide both ACPI tables and a device tree, even if ovmf happens to be the only device tree user.
It also is not clear what other dragons (dependencies on classic x86 platform devices) are lurking in the ovmf codebase.
Support the new microvm features (possibly adding microvm support first) in other projects.
Candidate number one is of course libvirt because it is the foundation for many other projects. Beside that microvm support is probably mostly useful for cloud/container-style workloads, i.e. kata and kubevirt.