physical address space in qemu
The physical addess space is where all memory and most IO resources are located. PCI memory bars, PCI MMIO bars, platform devices like lapic, io-apic, hpet, tpm, ...
On your linux machine you can use lscpu
to see the size
of the physical address space:
$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual ^^^^^^^^^^^^^^^^ [ ... ]
In /proc/iomem
you can see how the address space is
used. Note that the actual addresses are only shown to root.
The physical address space problem on x86_64
The very first x86_64 processor (AMD Opteron) shipped with a physical address space of 40 bits (aka one TeraByte). So when qemu added support for the (back then) new architecture the qemu vcpu likewise got 40 bits of physical address space, probably assuming that this would be a safe baseline. It is still the default in qemu (version 8.1 as of today) for backward compatibility reasons.
Enter Intel. The first 64-bit processors shipped by Intel featured only 36 bits of physical address space. More recent Intel processors have 39, 42 or more physical address bits. Problem is this limit applies not only to the real physical addess space, but also to Extended Page Tables (EPT). Which means the physical address space of virtual machines is limited too.
So, the problem is the virtual machine firmware does not know how much physical address space it actually has. When checking CPUID it gets back 40 bits, but it could very well be it actually has only 36 bits.
Traditional firmware behavior
To address that problem the virtual machine firmware was very conservative with address space usage, to avoid crossing the unknown limit.
OVMF used to have a MMIO window with fixed size (32GB), which was based on the first multiple of 32GB after normal RAM. So a typical, smallish virtual machine had 0 -> 32GB for RAM and 32GB -> 64GB for IO, staying below the limit for 36 bits of physical address space (which equals 64GB).
VMs having more than 30GB of RAM will need address space above 32GB for RAM, which pushes the IO window above the 64GB limit. The assumtion that hosts which have enough physical memory to run such big virtual machines also have a physical address space larger than 64GB seems to have worked good enough.
Nevertheless the fixed 32G-sized IO window became increasingly problematic. Memory sizes are growing, not only for main memory, but also for device memory. GPUs have gigabytes of memory these days.
Config options in qemu
Qemu has tree -cpu
options to control physical address
space advertized to the guest, for quite a while already.
- host-phys-bits={on,off}
-
When enabled qemu will use the hosts physical address bits for the
guest, i.e. the guest can see the actual limit. I recommend
enable this everywhere.
Upstream default:off
(except for-cpu host
where it ison
).
Some downstream linux distro builds flip this toon
by default. - host-phys-bits-limit=bits
-
Is used only with
host-phys-bits=on
. Can be used to reduce the number of physical address space bits communicated to the guest. Useful for live migration compatibility in case your machine cluster has machines with different physical address space sizes. - phys-bits=bits
-
Is used only with
host-phys-bits=off
. Can be used to set the number of physical address space bits to any value you want, including non-working values. Use only if you know what you are doing, it's easy to shot yourself into the foot with this one.
Changes in OVMF
Recent OVMF versions (edk2-stable202211 and newer) try to figure the
size of the physical address space using a heuristic: In case the
physical address space bits value received via CPUID is 40 or below
it is checked against known-good values, which are 36 and 39 for
Intel processors and 40 for AMD processors. If that check passes or
the number of bits is 41 or higher OVMF assumes qemu is configured
with host-phys-bits=on
and the value can be trusted.
In case there is no trustworthy phys-bits value OVMF will continue with the traditional behavior described above.
In case OVMF trusts the phys-bits value it will apply some OVMF-specific limitations before actually using it:
-
The concept if virtual memory does not exist in UEFI, so the
firmware will identity-map everything. Without 5-level paging
(which is not yet supported in OVMF) at most 128TB (phys-bits=47)
can be identity-mapped, so OVMF can not use more than that.
The actual limit is phys-bits=46 (64TB) for now due to older linux kernels (4.15) having problems if OVMF uses phys-bits=47. - In case gigabyte pages are not available OVMF will not use more than phys-bits=40 (1TB). This avoids high memory usage and long boot times due to OVMF creating lots of page tables for the identity mapping.
The final phys-bits value will be used to calculate the size of the physical address space available. The 64-bit IO window will be placed as high as possibe, i.e. at the end of the physical address space. The size of the IO window and also the size of the PCI bridge windows (for prefetchable 64-bit bars) will be scaled up with the physical address space, i.e. on machines with a larger physical address space you will also get larger IO windows.
Changes in SeaBIOS
Starting with version 1.16.3 SeaBIOS uses a heuristic simliar to OVMF to figure whenever there is a trustworthy phys-bits value.
If that is the case SeaBIOS will enable the 64-bit IO window by default and place it at the end of the address space like OVMF does. SeaBIOS will also scale the size of the IO window with the size of the address space.
Although the overall behavior is simliar there are some noteworthy differences:
- SeaBIOS will not enable the 64-bit IO window in case there is no RAM above 4G, for better compatibility with old -- possibly 32-bit -- guests.
- SeaBIOS will not enable the 64-bit IO window in case the CPU has no support for long mode (i.e. it is a 32-bit processor), likewise for better compatibility with old guests.
- SeaBIOS will limit phys-bits to 46, simliar to OVMF, likewise for better compatibility with old guests. SeaBIOS does not use paging though and does not care about support for gigabyte pages, it will never limit phys-bits to 40.
- SeaBIOS has a list of devices which will never be placed in the 64-bit IO window. This list includes devices where SeaBIOS drivers must be able to access the PCI bars. SeaBIOS runs in 32-bit mode so these PCI bars must be mapped below 4GB.
Changes in qemu
Starting with release 8.2 the firmware images bundled with upstream qemu are new enough to include the OVMF and SeaBIOS changes described above.
Live migration and changes in libvirt
The new firmware behavior triggered a few bugs elsewhere ...
When doing live migration the vcpu configuration on source and target host must be identical. That includes the size of the physical address space.
libvirt can calculate the cpu baseline for a given cluster, i.e. create a vcpu configuration which is compatible with all cluster hosts. That calculation did not include the size of the physical address space though.
With the traditional, very conservative firmware behavior this bug did not cause problems in practice, but with OVMF starting to use the full physical address space live migrations in heterogeneous clusters started to fail because of that.
In libvirt 9.5.0 and newer this has been fixed.
Trouble shooting tips
In general, it is a good idea to set the qemu
config option host-phys-bits=on
.
In case guests can't deal with PCI bars being mapped at high
addresses the host-phys-bits-limit=bits
option
can be used to limit the address space usage. I'd suggest to stick
to values seen in actual processors, so 40 for AMD and 39 for Intel
are good candidates.
In case you are running 32-bit guests with alot of memory (which btw
isn't a good idea performance-wise) you might need turn off long
mode support to force the PCI bars being mapped below 4G. This can
be done by simply using qemu-system-i386
instead
of qemu-system-x86_64
, or by explicitly
setting lm=off
in the -cpu
options.