In version 4.2 the microvm machine type was added to qemu. The initial commit describes it this way:

It's a minimalist machine type without PCI nor ACPI support, designed for short-lived guests. microvm also establishes a baseline for benchmarking and optimizing both QEMU and guest operating systems, since it is optimized for both boot time and footprint.

The initial code uses the minimal qboot firmware to initialize the guest, to load a linux kernel and boot it. For network/storage/etc devices virtio-mmio is used. The configuration is passed to the linux kernel on the command line so the guest is able to find the devices.

That works for direct kernel boot, using qemu -kernel vmlinuz, because qemu can easily patch the kernel command line then. But what if you want - for example - boot the Fedora Cloud image? Using the Fedora kernel stored within the image?

A better plan for device discovery

When not using direct kernel boot patching the kernel command line for device discovery isn't going to fly, so we need something else. There are two established standard ways to do that in modern systems. One is device trees. The other is ACPI. Both have support for virtio-mmio.

A device tree entry for virtio-mmio looks like this:

virtio_block@3000 {
	compatible = "virtio,mmio";
	reg = <0x3000 0x100>;
	interrupts = <41>;
}

And this is the ACPI DSDT version:

Device (VR06)
{
    Name (_HID, "LNRO0005")  // _HID: Hardware ID
    Name (_UID, 0x06)  // _UID: Unique ID
    Name (_CCA, One)  // _CCA: Cache Coherency Attribute
    Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
    {
        Memory32Fixed (ReadWrite,
            0xFEB00C00,         // Address Base
            0x00000200,         // Address Length
            )
        Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )
        {
            0x00000016,
        }
    })
}

Both carry essentially the same information: What kind of device that is and which resources (registers & interrupt) it uses.

On the arm platform both are established, with device trees being common for small board computers like the raspberry pi and ACPI being used in the arm server space. On the x86 platform we don't have much of a choice though. There are some niche attempts to establish device trees, the google android goldfish platform for example. But for widespread support there is no way around using ACPI.

The nice thing about arm server using ACPI too is that this paved the way for us. The linux kernel supports both device tree and ACPI for the discovery of virtio-mmio devices:

root@fedora-bios ~/acpi# modinfo virtio-mmio | grep alias
alias:          of:N*T*Cvirtio,mmioC*
alias:          of:N*T*Cvirtio,mmio
alias:          acpi*:LNRO0005:*

So linux kernel support is a solved problem already. Yay!

virtio-mmio support for seabios

If we want load a kernel from the disk image the firmware must be able to find and read the disk. We already have virtio-pci support for blk and scsi in seabios. The differences between virtio-pci and virtio-mmio transports are not that big. Also some infrastructure for different transport modes was already there to deal with legacy vs. modern virtio-pci. So adding virtio-mmio support to the drivers wasn't much of a problem.

But of course seabios also has the problem that it must discover the devices before it can initialize the driver. Various approaches to find virtio-mmio devices using the available information sources where tried. All of them had the one or the other non-working corner case, except using ACPI. So seabios ended up getting a simple DSDT parser for device discovery.

While being at it some other small fixes where added to seabios too to make it work better with microvm. The hard dependency on the RTC CMOS has been removed for example, so latest seabios works fine with qemu -M microvm,rtc=off.

This ships with seabios version 1.14.

While speaking about seabios: When using a serial console I'd strongly recommend to run with qemu -M microvm,graphics=off. That will enable serial console support in seabios. This is one of the tweaks done by the qemu -nographic shortcut. The machine option works with the pc and q35 machine types too.

ACPI cleanups in qemu

Hooking up ACPI support for microvm on the qemu side turned out to be surprisingly difficuilt due to some historical baggage.

Years ago qemu used to have a static ACPI DSDT table. All ISA devices (serial & parallel ports, floppy, ...) are declared there, but they might not be actually present depending on qemu configuration. The LPC/ISA bride has some bits in pci config space saying whenever a device is actually present or not (qemu emulation follows physical hardware behavior here). So the devices have a _STA method looking up those bits and returning the device status. The guest had to run the method using AML interpreter to figure whenever the declared device is actually there.

// this is the qemu q35 ISA bridge at 00:1f.0
Device (ISA)
{
    Name (_ADR, 0x001F0000)  // _ADR: Address
    // ... snip ...
    OperationRegion (LPCE, PCI_Config, 0x82, 0x02)
    Field (LPCE, AnyAcc, NoLock, Preserve)
    {
        CAEN,   1,  // serial port #1 enable bit
        CBEN,   1,  // serial port #2 enable bit
        LPEN,   1   // parallel port enable bit
    }
}

// serial port #1
Device (COM1)
{
    Name (_HID, EisaId ("PNP0501") /* 16550A-compatible COM Serial Port */)
    Name (_UID, One)  // _UID: Unique ID
    Method (_STA, 0, NotSerialized)  // _STA: Status
    {
        Local0 = CAEN /* \_SB_.PCI0.ISA_.CAEN */
        If ((Local0 == Zero))
        {
            // serial port #1 is disabled
            Return (Zero)
        }
        Else
        {
            // serial port #1 is enabled
            Return (0x0F)
        }
    }
    // ... snip ...
}

The microvm machine type simply has no PCI support, so that approach isn't going to fly. Also these days all ACPI tables are dynamically generated anyway, so there is no reason to have the guests AML interpreter go dig into pci config space. Instead we can handle that in qemu when generating the DSDT table. Disabled devices are simply not listed. For enabled devices this is enough:

// serial port #1
Device (COM1)
{
    Name (_HID, EisaId ("PNP0501") /* 16550A-compatible COM Serial Port */)
    Name (_UID, One)  // _UID: Unique ID
    Name (_STA, 0x0F)  // _STA: Status
    // ... snip ...
}

So I've ended up reorganizing and simplifying the code which creates the DSDT entries for ISA devices. This landed in qemu version 5.1.

ACPI support for microvm

Now with the roadblocks out of the way it was finally possible to add acpi support to microvm. There is little reason to worry about backward compatibility to historic x86 platforms here, old guests wouldn't be able to handle virtio-mmio anyway. So this takes a rather modern approach and looks more like an arm virt machines than a x86 q35 machine. Like arm it uses the generic event device for power management support.

ACPI support for microvm is switchable, simliar to the other machine types, using the acpi=on|off machine option. The -no-acpi switch works too. By default ACPI support is enabled.

With ACPI enabled qemu uses virtio-mmio enabled seabios as firmware and doesn't bother patching the linux kernel command line for device discovery.

With ACPI disabled qemu continues to use qboot as firmware like older qemu versions do. Likewise it continues to add virtio-mmio devices to the linux kernel command line.

This will be available in qemu version 5.2. It is already merged in the master branch.

ACPI advantages

  • Number one is device discovery obviously, this is why we started all this in the first place. seabios and linux kernel find virtio-mmio devices automatically. You can boot Fedora cloud images in microvm without needing any tricks. Probably other distros too, even though I didn't try that. Compiling the linux kernel with CONFIG_VIRTIO_MMIO=y (or =m & adding the module to initramfs) is pretty much the only requirement for this to work.

  • Number two is device discovery too. ACPI will also tell the kernel which devices are not there. So with acpi=on the kernel simply skips the PS/2 probe in case the DSDT doesn't list a keyboard controller. With acpi=off the kernel assumes legacy hardware, goes into probe-harder mode and needs one second to figure that there really is no keyboard controller:

    [    0.414840] i8042: PNP: No PS/2 controller found.
    [    0.415287] i8042: Probing ports directly.
    [    1.454723] i8042: No controller found
    

    We have an simliar effect with the real time clock. With acpi=off the kernel goes register an IRQ line for the RTC even in case the device isn't there.

  • Number three is (basic) power management. ACPI provides a virtual power button, so the guest will honor shutdown requests sent that way. ACPI also provides S5 (aka poweroff) support, so qemu gets a notification from the guest when the shutdown is done and can exit.

  • Number four is better IRQ routing. The linux kernel finds the IO-APIC declared in the APIC table and uses it for IRQ routing. It is possible to use lines 16-23 for the virtio-mmio devices, avoiding IRQ sharing. Also we can refine the configuration using IRQ flags in the DSDT table.

    With acpi=off this does not work not reliable. I've seen the kernel ignore the IO-APIC in the past. Doesn't always happen though. Not clear which factors play a role here, I didn't investigate that in detail. Maybe newer kernel versions are a bit more clever here and find the IO-APIC even without ACPI.

Bottom line: ACPI helps moving the microvm machine type forward towards a world without legacy x86 platform devices.

But isn't ACPI bloated and slow?

Well, on my microvm test guest all ACPI tables combined are less than 1k in size:

root@fedora-bios /sys/firmware/acpi/tables# find -type f | xargs ls -l
-r--------. 1 root root  78 Oct  2 09:36 ./APIC
-r--------. 1 root root 482 Oct  2 09:36 ./DSDT
-r--------. 1 root root 268 Oct  2 09:36 ./FACP

I wouldn't call that bloated. This is a rather small virtual machine, with larger configurations (more CPUs, more devices) the tables will grow a bit of course.

When testing boot times I figured it is pretty hard to find any differences due to ACPI initialization. The noise (differences when doing 2-3 runs with identical configuration) is larger than the acpi=on/off difference. Seems to be at most a handful of milliseconds.

When trying that yourself take care to boot the kernel with 'quiet'. This is a good idea anyway if you want boot as fast as possible. The kernel prints more boot information with acpi=on, so slow console logging can skew your numbers if you let the kernel print out everything.

Runtime differences should be zero. There is only one AML method in the DSDT table. It toggles the power button when a notification comes in from the generic event device. It runs only on generic event device interrupts.

USB support for microvm

qemu just got a sysbus (non-pci) version of the xhci host adapter. It is needed for some arm boards. Nice thing is now that we have ACPI we can just wire that up in microvm too, add it in the DSDT table, then linux will find and use it:

Device (XHCI)
{
    Name (_HID, EisaId ("PNP0D10") /* XHCI USB Controller */)
    Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
    {
        Memory32Fixed (ReadWrite,
            0xFE900000,         // Address Base
            0x00004000,         // Address Length
            )
        Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )
        {
            0x0000000A,
        }
    })
}

USB support will be disabled by default, it can be enabled using the usual machine option: qemu -M microvm,usb=on.

Patches for qemu are in flight, should land in version 5.2
Patches for seabios are merged, will be available in version 1.15

PCIe support for microvm

There is one more arm platform thing we can reuse in microvm: The PCI Express host bridge. Again the same approach: Wire everything up, declare it in the ACPI DSDT. Linux kernel finds and uses it.

Not adding an asl snippet this time. The PCIe host bridge is a complex device so the description is a bit larger. It has GSI subdevices, IRQ routing information for each PCI slot, mmconfig configuration etc. Also shows in the DSDT size (even though that is still less than half the size q35 has):

root@fedora-bios /sys/firmware/acpi/tables# ll DSDT
-r--------. 1 root root 3130 Oct  2 10:20 DSDT

PCIe support will be disabled by default, it can be enabled using the new pcie machine option: qemu -M microvm,pcie=on.

This will be available in qemu version 5.2. It is already merged in the master branch.

Future work

My TODO list for qemu isn't very long:

  • Add second IO-APIC, allowing more IRQ lines for more virtio-mmio devices. Experimental patches exist.

  • IOMMU support, using virtio-iommu. Depends on ACPI spec update for virtio-iommu being finalized and support being merged into qemu and linux kernel. The actual wireup for microvm should be easy once all this is done.

Outside qemu there are a few more items:

  • Investigate microvm PCIe support in seabios. Experimental patches exist. I'm not sure yet whenever seabios should care though.

    The linux kernel can initialize the PCIe bus just fine on its own, whereas proper seabios support has its challenges. When running in real mode the mmconfig space can not be reached. Legacy pci config space access via port 0xcf8 is not available. Which breaks pcibios support. Probably fixable with a temporary switch to 32bit mode, at the cost of triggering a bunch of extra vmexits. Beside that seabios will spend more time at boot, initializing the pci devices.

    So, is it worth the effort? The benefit would be that seabios could support booting from pci devices on microvm then.

  • Maybe add microvm support to edk2/ovmf.

    Looks not that easy on a quick glance. ArmVirtPkg depends on device trees for virtio-mmio detection, so while we can re-use the virtio-mmio drivers we can not re-use device discovery code. Unless we maybe have qemu provide both ACPI tables and a device tree, even if ovmf happens to be the only device tree user.

    It also is not clear what other dragons (dependencies on classic x86 platform devices) are lurking in the ovmf codebase.

  • Support the new microvm features (possibly adding microvm support first) in other projects.

    Candidate number one is of course libvirt because it is the foundation for many other projects. Beside that microvm support is probably mostly useful for cloud/container-style workloads, i.e. kata and kubevirt.