Migrating a vagrant-libvirt VM to a newer host: skip `vagrant package`, hand it to virsh

Moving a vagrant-libvirt VM between Linux hosts looks like a two-step (vagrant package then vagrant up on the other side), and that's exactly what I tried. Both steps died in the same place: fog-libvirt's stream upload/download of a large qcow2 from/to libvirt's storage pool reset mid-flight (Cannot recv data: Connection reset by peer, hung at 0%).

The streaming bug is in fog-libvirt's vol upload/download. The fix is to bypass vagrant-libvirt's vol-upload/vol-download entirely: flatten the overlay qcow2 against its backing file, drop the result in a libvirt storage pool by hand, then virsh define + virsh start. Treat vagrant-libvirt as the boot-time scaffolding only; the running VM is plain libvirt after that.

1. Make the box self-contained

vagrant package exists to bake the VM's disk + metadata into a .box file. With a libvirt provider the disk is usually a qcow2 with a backing file (qemu-img info ... | grep "backing file"), and vol-download only streams the overlay — you'd ship an incomplete box. Skip it and flatten manually:

# on the source host
sudo qemu-img convert -O qcow2 -p \
  /var/lib/libvirt/images/<vm-name>.img \
  /path/to/box.img

box.img is ~30 GB on a 127 GiB / 20 GB-used Windows guest (sparse qcow2 expands into the file). The source vol keeps working — convert is read-only on the input.

2. Ship the box to the new host

# on the destination host, make a non-default pool first if /var is tight
sudo mkdir -p /home/<user>/libvirt-pool
sudo chown libvirt-qemu:libvirt /home/<user>/libvirt-pool
virsh pool-define-as homepool dir --target "/home/<user>/libvirt-pool"
virsh pool-build homepool && virsh pool-start homepool
virsh pool-autostart homepool

# convert directly into the pool (bypasses fog-libvirt stream upload)
sudo qemu-img convert -O qcow2 -p ~/box.img \
  /home/<user>/libvirt-pool/<vm-name>.img

# build the box on the destination so vagrant-cli recognises it
cd ~ && cat > metadata.json <<EOF
{"provider":"libvirt","format":"qcow2","virtual_size":$(qemu-img info box.img | awk -F'[()]' '/virtual size/{gsub(" GiB","",$2); print int($2)}')}
EOF
tar cf gitbash-local.box metadata.json box.img
vagrant box add gitbash-local gitbash-local.box

Two libvirt gotchas to remember:

  • The default pool path is /var/lib/libvirt/images. On a host with a small /var partition (Ubuntu 24.04 stock is 63 GB; apt and friends fill it fast), the default pool runs out of room long before you finish a 30 GB VM. Make a pool under /home and use that. df -h /var is part of pre-flight.
  • Ubuntu's libvirt qemu runs as libvirt-qemu, not root. Pool directories need to be chown libvirt-qemu:libvirt and 771, or pool-build will fail with a permission error. Debian still uses root.

3. Hand the VM to virsh

Skip vagrant up. The virsh define of a saved dumpxml is enough:

# pull dumpxml from the source (and a network it depends on, e.g.
# vagrant-libvirt's management network)
virsh -c qemu:///system dumpxml <source-domain> > gitbash.xml
virsh -c qemu:///system net-dumpxml vagrant-libvirt > vagrant-libvirt.xml

# edit: change <name>, replace disk <source file=> with the new pool path,
# drop <uuid> from network (libvirt will generate a fresh one)
sed -i ... gitbash.xml
sed -i '/<uuid>/d' vagrant-libvirt.xml

# on the destination
scp gitbash.xml vagrant-libvirt.xml dest:/tmp/
ssh dest "virsh net-define /tmp/vagrant-libvirt.xml
          virsh net-start vagrant-libvirt
          virsh net-autostart vagrant-libvirt
          virsh define /tmp/gitbash.xml
          virsh start gitbash"

4. When the qemu versions don't match

The dumpxml's <type arch= machine=...> is host-specific. Pull the latest machine type the destination's qemu actually supports:

qemu-system-x86_64 -machine help | grep pc-q35

If the source was Debian-trixie with qemu 10 (pc-q35-10.0) and the destination is Ubuntu 24.04 with qemu 8.2 (pc-q35-noble max), the <type> is downgraded in the XML and the Windows guest has to re-run HAL on first boot — read 600 MB, write 130 MB, take 5–10 minutes through one or two auto-restarts. Disk I/O stat with virsh domblkstat <domain> will stay high; that is not a BSOD, it's the HAL reconfig running. Wait it out.

5. What survives the move on its own

  • Tailscale identity: the machine key lives on the disk; the IP (100.x.y.z) re-binds when the service comes back up.
  • Cloudflare DNS records: no change needed (the IP is preserved).

6. What doesn't survive

  • Service autostart types: HAL reconfig rewrites service-start graphs. The Tailscale service dropped to Manual after the move; first boot had no Tailscale IP until I ran Start-Service Tailscale from an elevated PowerShell. Verify on next reboot and Set-Service ... -StartupType Automatic if it didn't stick.
  • vagrant CLI control: once you virsh start from the hand-written XML, the domain is no longer in vagrant's .vagrant.d index. vagrant up from the project dir on the new host would create a second domain alongside this one. Either keep ownership of the domain via virsh and don't vagrant up, or rebuild the project machines/<name>/libvirt/ state to satisfy vagrant-libvirt's bookkeeping.

The recovery path is also better now: the flattened box.img on the source host, the source's original vol, and a tar-able metadata.json are enough to rebuild on either host without going back to the original cloud box.

Comments

  1. Markdown is allowed. HTML tags allowed: <strong>, <em>, <blockquote>, <code>, <pre>, <a>.