Hey there, It's time to talk about virtualization again. Almost 3 years after my initial attempts at squashing my vidya into a VM, I figured that a follow-up post is more than overdue.
Enjoy this little writeup of last-week's hardware woes and software headaches c:
Before we start: This is not a tutorial.
If you want to build a VM yourself, here are some helpful links.
My setup changed quite a bit since my last post.
Here's what I currently use, screenfetch-style.
My current mainboard (Asus PRIME B350 Plus) is a relict of old ideas.
It was recycled from a cheap NAS that I built in 2018.
Boy was that a bad idea.
As it turns out, this board is the worst choice you could possibly make for GPU passthrough. It has two "x16" PCIe slots, but the second one only has enough pins to reach x8. I'm not kidding you. They literally didn't solder the other pins on that slot, but still label and sell it as x16.
Well whatever, bad mainboards won't stop us, right?
Turns out if you disable "CSM" and boot once with cables only connected to the second GPU, it will remember that for the following boots. Nice, one step closer.
From here on I did the usual steps.
Blacklisted my Vega using vfio-pci:
Enabled AVIC and nested page tables:
And finally added some new kernel params:
This time I decided to use a more "manual" approach.
I ditched libvirt and wrote a simple script that launches QEMU.
You can see the whole thing at glitch.sh, but I'll also walk you through some of the more important content here.
In order of the file:
echo performance | sudo tee /sys/devices/system/cpu/cpu*/...
The VM can't control the CPU clock, so we need to ensure
that Linux doesn't underestimate our workload.
echo 1 | sudo tee /sys/bus/pci/...
This is a workaround for Vega. More on that later.
Allow closing the terminal without killing the VM
The guest uses an AMD GPU so we don't need to lie to Windows about the VM.
Yay team red!
debug-threads is important because we'll need it later for manual CPU pinning.
+topoext is required to use SMT/HT on AMD.
host-cache-info=on will pass the CPU's cache topology instead of emulating something.
hv_* are the usual "HyperV Enlightenments".
This sets the CPU core topology.
The Ryzen has 8 cores and 16 threads.
I decided to pass half of it to windows.
Usually people use more
threads=1, but here's the thing with Ryzen:
The R7 consists of two "CCX" which both house 4 cores and have their own cache.
They are glued together with "Infinity Fabric" and can exchange data at roughly 40GB/s.
So it makes sense to pass one complete CCX and expose it's SMT topology to increase cache locality.
Or in a simple picture (Blue Host, Green VM):
-m 16G ... -mem-path /hugepages/...
Memory allocations are an important topic to think about. This ensures that QEMU utilizes memory that was allocated in contiguous 1GB blocks at boot time, instead of falling back to the default (possibly fragmented) 4KB pages.
The kernel params for this are:
Then mount them with:
hugetlbfs /hugepages hugetlbfs defaults 0 0
This permanently locks 16G away, but the remaining 16G are more than enough for the host.
This is simply the modern replacement for
This adds a PCI "root port" that the GPU attaches to. Otherwise it will seem to windows like the GPU was connected directly to the root bus, which will cause QEMU to change the emulated configuration to "Integrated Endpoint".
This means it looks to Windows like the GPU was physically inside the PCIe controller, and (more importantly) in this mode QEMU will omit any link speed configuration.
So TLDR, without this your GPU will likely run much slower.
Not just "slightly slow". We're talking PCIe x1 vs x16.
Passes keyboard and mouse via PS/2 using evdev.
This allows switching between Guest/Host on the fly by pressing LCtrl-RCtrl.
Passes the keyboard and mouse using VirtIO.
Automatically takes priority over PS/2 in the guest.
This still uses evdev events but omits a lot of emulation overhead,
and - subjectively - works a lot better and perfectly stutter-free in games.
Ok so with the VM up and running, let's talk about CPU pinning.
In libvirt that's rather easy, but it's also doable manually.
That flag adds some pretty useful information to QEMU's
which makes spotting the virtualized CPUs very easy:
In that loop we can use
taskset and/or cgroups to assign the "cpu process" to a fixed CPU.
Make sure to pin the CPUs in the correct order.
For a Ryzen 7 this means:
0=>4, 1=>12, 2=>5, 3=>13, 4=>6, 5=>14, 6=>7, 7=>15 (guest=>host)
Not doing this will mix the two CCX's and cause a lot of lags and generally degraded performance.
Additionally, I isolated the VM CPUs from the rest of the system using kernel params.
This ensures that linux doesn't consider putting any tasks on these cores, to reduce context switches.
You could theoretically also move any and all host-pids into a "host" cgroup which only has access to the other 8 cores. This would allow you to utilize all 16 threads when the vm is off, but it's (imo) a lot more complicated, and I don't really need more than 4c/8t on linux anyway.
Tip: To diagnose context switch problems use
perf record -e 'sched:sched_switch' -C 4-7,12-15
(Obviously, adapt the
-C param to your system).
VEGA has something called the "reset issue" where it cannot be used anymore after the VM shuts down or reboots, until the host power-cycles or goes into standby.
One workaround that works for me is to only passthrough the GPU "function", and leave the sound device unmapped. That will print some QEMU warnings during startup but generally lasts for at least 6-10 VM resets without any noticable side-effects.
This however required the
/rescan "patch" on my system.
Otherwise the host would eventually lock up.
Don't ask me why, I don't have an answer yet.
I'll do a follow-up post if I ever find out how to get this working cleanly.
It works well. Pretty well.
Time Spy reports a graphics score of 7151. (https://www.3dmark.com/3dm/36575667)
Guru3D scored 7.5k with the exact same GPU model (not overclocked),
which means the VM is running at roughly 96% bare-metal GPU performance.
The CPU score reaches 4336.
Other Ryzen 7 benchmarks usually score ~90% higher, which is expected
when you consider that only half the cores are passed.
To finish up, here's a final pic of Linux running Windows running CoD Zombies :)