veilor-os/test/run-vm.sh
veilor-org e83483a077 v0.5.30: broad error suppression + manual bootloader + virtio log capture
Three-layer fix for the persistent anaconda transaction failure that
killed v0.5.28 (gen_grub_cfgstub) and v0.5.29 (aggregate dnf5 error).

## Layer 1: broad error suppression in transaction_progress.py

dnf5 under RPM 6.0 + cmdline anaconda emits a final aggregate
`error("transaction process has ended with errors..")` at end of
transaction whenever its internal failure counter > 0, regardless of
whether we suppressed individual script_error events. Reproduced
twice. The narrow patch in v0.5.29 suppressed per-package errors but
the aggregate still raised PayloadInstallationError and aborted the
install before the bootloader phase ran.

v0.5.30 patch turns the `elif token == 'error':` branch in
process_transaction_progress into a log.warning. All four producers
(cpio_error, script_error, unpack_error, generic error) now flow
through to a warning + continue. Pattern matches both the original
anaconda layout AND the v0.5.29 narrow-patched layout, so re-applying
on top of either is a no-op.

This brings us back to v0.5.28 broad-suppression behaviour. The
side effect that bit us in v0.5.28 (silent grub2-efi-x64 scriptlet
failure → empty /boot/efi/EFI/fedora/ → gen_grub_cfgstub fails)
is addressed by Layer 2 below.

## Layer 2: bootloader install moved out of anaconda

The generated install kickstart now has `bootloader --location=none`,
which tells anaconda NOT to invoke its own bootloader install code
path (and therefore NOT to call gen_grub_cfgstub). All grub work
moves into the chroot %post block:

  1. `dnf reinstall grub2-efi-x64 grub2-pc grub2-tools shim-x64
     efibootmgr` — re-runs scriptlets in the chroot with full
     PID 1 systemd state, so the systemd-run-style triggers that
     anaconda's chroot truncates actually execute.
  2. `grub2-install --target=x86_64-efi --efi-directory=/boot/efi
     --bootloader-id=fedora --no-nvram` — populates /boot/efi/EFI/fedora/
  3. `gen_grub_cfgstub /boot/grub2 /boot/efi/EFI/fedora` (or
     `grub2-mkconfig` fallback) — writes /boot/efi/EFI/fedora/grub.cfg.
  4. `efibootmgr -c -d <disk> -p <part> -L "veilor-os" -l \EFI\fedora\shimx64.efi`
     — registers the NVRAM boot entry pointing at the signed shim.

Each step logs to stdout and continues on failure (`set +e` block);
diagnostics surface in the install log without aborting the whole
%post.

## Layer 3: virtio-serial log capture in run-vm.sh

Anaconda 43.x autodetects `/dev/virtio-ports/org.fedoraproject.anaconda.log.0`
and streams program/packaging/storage/anaconda logs through it in
real time, before any tmpfs / pivot, before networking, surviving
kernel panic. Wiring it into run-vm.sh means the host gets a
tail-able log file at `test/anaconda-vm-YYYYMMDD-HHMMSS.log` for
every VM run.

We've lost logs three times in a row to anaconda failures + tmpfs
reboots. This breaks the loop.

## Diagnostic story

Before this commit: VM aborts → live ISO reboots itself → /tmp/
tmpfs gone → no logs → guess what failed. Three days, two and a
half false fixes.

After this commit: VM aborts → host has /home/admin/ai-lab/_github/veilor-os/test/anaconda-vm-*.log
with the actual scriptlet output, the actual exit codes, the
actual file-trigger failures. Future debug becomes evidence-based.

Files changed:
  kickstart/veilor-os.ks        — broad error suppression patch
  overlay/usr/local/bin/veilor-installer — --location=none + manual grub
  test/run-vm.sh                — virtio-serial chardev wiring

Verified: bash -n clean, ksvalidator clean.
2026-05-05 11:59:35 +01:00

245 lines
9.3 KiB
Bash
Executable file

#!/usr/bin/env bash
# Boot veilor-os ISO in KVM/QEMU under UEFI.
# Usage:
# ./test/run-vm.sh # boots latest ISO from build/out
# ./test/run-vm.sh path/to.iso # specific ISO
# SECBOOT=1 ./test/run-vm.sh # use OVMF Secure Boot firmware
# FRESH=1 ./test/run-vm.sh # wipe disk + nvram, re-install from scratch
# NO_INJECT=1 ./test/run-vm.sh # skip SSH-key auto-injection
#
# SSH-key auto-injection (chosen approach: dual — cloud-init NoCloud + QEMU
# monitor sendkey fallback)
# ------------------------------------------------------------------
# Goal: previously each test required logging in at the QEMU console and
# running `passwd -d liveuser`, editing sshd_config, etc. before
# `ssh -p 2222 liveuser@localhost` worked. This script eliminates that.
#
# Primary path (works for the *installed* system, not the live image):
# * Detect host pubkey at ~/.ssh/id_ed25519.pub or ~/.ssh/id_rsa.pub
# * Build a NoCloud cloud-init ISO (user-data + meta-data) via mkisofs/xorriso
# * Mount it as a second virtual cdrom — Anaconda/cloud-init picks it up
# automatically when installing because the seed has the magic
# `cidata` volume label.
#
# Fallback path (works for the *live* image, which doesn't run cloud-init by
# default — dracut-live + livesys-scripts mount squashfs read-only and skip
# cloud-init.target):
# * Open a QEMU monitor unix socket (-monitor unix:...).
# * After ~90s (long enough for SDDM autologin → liveuser), background a
# helper that pipes a sequence of `sendkey` events to the monitor:
# Ctrl+Alt+F2 (drop to TTY)
# "sudo passwd -d liveuser && sudo systemctl reload sshd\n"
# This unblocks SSH on port 2222 without manual interaction.
#
# Both paths are best-effort; if the host has no pubkey, both are skipped
# and the script behaves exactly as before.
set -euo pipefail
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
TEST_DIR="$REPO_ROOT/test"
DISK="$TEST_DIR/veilor-vm.qcow2"
NVRAM="$TEST_DIR/veilor-vm.nvram"
SEED_ISO="$TEST_DIR/cloud-init-seed.iso"
MONITOR_SOCK="$TEST_DIR/veilor-vm.monitor.sock"
ISO="${1:-$(ls -t "$REPO_ROOT"/build/out/*.iso 2>/dev/null | head -1)}"
[[ -n ${ISO:-} && -f $ISO ]] || { echo "[ERR] No ISO found. Build first: ./build/build-iso.sh"; exit 1; }
# OVMF firmware selection
if [[ "${SECBOOT:-0}" == "1" ]]; then
OVMF_CODE=/usr/share/edk2/ovmf/OVMF_CODE.secboot.fd
OVMF_VARS_SRC=/usr/share/edk2/ovmf/OVMF_VARS.secboot.fd
NVRAM="$TEST_DIR/veilor-vm.nvram.secboot"
else
OVMF_CODE=/usr/share/edk2/ovmf/OVMF_CODE.fd
OVMF_VARS_SRC=/usr/share/edk2/ovmf/OVMF_VARS.fd
fi
# Reset on FRESH=1
if [[ "${FRESH:-0}" == "1" ]]; then
rm -f "$DISK" "$NVRAM" "$SEED_ISO"
fi
# Provision disk + per-VM nvram once
[[ -f $DISK ]] || qemu-img create -f qcow2 "$DISK" 40G
[[ -f $NVRAM ]] || cp "$OVMF_VARS_SRC" "$NVRAM"
# ── Locate host SSH pubkey (ed25519 preferred, rsa fallback) ──
HOST_PUBKEY=""
if [[ "${NO_INJECT:-0}" != "1" ]]; then
for cand in "$HOME/.ssh/id_ed25519.pub" "$HOME/.ssh/id_rsa.pub"; do
if [[ -f $cand ]]; then
HOST_PUBKEY="$(< "$cand")"
echo "[INFO] using host pubkey: $cand"
break
fi
done
fi
# ── Build cloud-init NoCloud seed ISO (primary path) ──
SEED_ARGS=()
if [[ -n $HOST_PUBKEY ]]; then
SEED_DIR="$(mktemp -d)"
trap 'rm -rf "$SEED_DIR"' EXIT
cat > "$SEED_DIR/meta-data" <<EOF
instance-id: veilor-test-vm
local-hostname: veilor-test
EOF
cat > "$SEED_DIR/user-data" <<EOF
#cloud-config
users:
- name: liveuser
ssh_authorized_keys:
- $HOST_PUBKEY
- name: admin
ssh_authorized_keys:
- $HOST_PUBKEY
lock_passwd: false
passwd:
ssh_pwauth: true
runcmd:
- rm -f /etc/ssh/sshd_config.d/10-veilor-hardening.conf
- systemctl reload sshd || systemctl restart sshd || true
EOF
# Build NoCloud ISO. Volume label MUST be "cidata" (case-insensitive)
# for cloud-init's NoCloud datasource to pick it up.
if command -v mkisofs >/dev/null 2>&1; then
mkisofs -quiet -output "$SEED_ISO" \
-volid cidata -joliet -rock \
"$SEED_DIR/user-data" "$SEED_DIR/meta-data"
elif command -v xorriso >/dev/null 2>&1; then
xorriso -as mkisofs -quiet -output "$SEED_ISO" \
-volid cidata -joliet -rock \
"$SEED_DIR/user-data" "$SEED_DIR/meta-data"
elif command -v cloud-localds >/dev/null 2>&1; then
cloud-localds "$SEED_ISO" "$SEED_DIR/user-data" "$SEED_DIR/meta-data"
else
echo "[WARN] no mkisofs/xorriso/cloud-localds — skipping cloud-init seed"
SEED_ISO=""
fi
if [[ -n $SEED_ISO && -f $SEED_ISO ]]; then
echo "[INFO] cloud-init seed ISO: $SEED_ISO"
SEED_ARGS=(-drive "file=$SEED_ISO,media=cdrom,readonly=on")
fi
fi
# ── QEMU monitor unix socket ──
# Always exposed so the host can drive the VM via `socat - UNIX-CONNECT:...`
# (sendkey, screendump, etc.) for debugging. Independent of pubkey injection.
rm -f "$MONITOR_SOCK"
MONITOR_ARGS=(-monitor "unix:$MONITOR_SOCK,server,nowait")
# ── Auto-inject helper (live ISO doesn't run cloud-init) ──
# Started in the background after a delay; sends keypresses through the
# QEMU monitor unix socket to drop to a TTY and unblock SSH for liveuser.
if [[ -n $HOST_PUBKEY ]]; then
(
# Wait for the VM to reach a usable login prompt (SDDM autologin →
# liveuser session is the most realistic target). 90s is enough on
# KVM/4 vCPUs; tune via VM_BOOT_DELAY if needed.
sleep "${VM_BOOT_DELAY:-90}"
[[ -S $MONITOR_SOCK ]] || exit 0
# send_chord <key1> [key2 ...] — chord released between calls
send_chord() {
local IFS='-'
local chord="$*"
printf 'sendkey %s\n' "$chord"
}
# send_str <text> — only ASCII printable + space + return
send_str() {
local s="$1" ch
local i=0
while (( i < ${#s} )); do
ch="${s:i:1}"
case "$ch" in
' ') printf 'sendkey spc\n' ;;
[a-z0-9]) printf 'sendkey %s\n' "$ch" ;;
[A-Z]) printf 'sendkey shift-%s\n' "${ch,,}" ;;
'-') printf 'sendkey minus\n' ;;
'_') printf 'sendkey shift-minus\n' ;;
'/') printf 'sendkey slash\n' ;;
'.') printf 'sendkey dot\n' ;;
'&') printf 'sendkey shift-7\n' ;;
esac
i=$((i+1))
done
}
{
send_chord ctrl alt f2
sleep 1
# Type: liveuser <enter> (no password by default on live)
send_str "liveuser"
printf 'sendkey ret\n'
sleep 2
send_str "sudo passwd -d liveuser"
printf 'sendkey ret\n'
sleep 1
send_str "sudo systemctl reload sshd"
printf 'sendkey ret\n'
} | socat - "UNIX-CONNECT:$MONITOR_SOCK" 2>/dev/null || true
) &
INJECT_PID=$!
trap 'kill $INJECT_PID 2>/dev/null || true; rm -f "$MONITOR_SOCK"; rm -rf "${SEED_DIR:-}"' EXIT
fi
echo "════════════════════════════════════════════════════════"
echo " veilor-os :: VM test"
echo " ISO : $ISO"
echo " Disk : $DISK"
echo " NVRAM : $NVRAM"
echo " Seed : ${SEED_ISO:-<none>}"
# Anaconda virtio-serial log channel.
#
# Anaconda 43.x autodetects /dev/virtio-ports/org.fedoraproject.anaconda.log.0
# and streams program/packaging/storage/anaconda logs through it in real
# time, before any tmpfs / pivot, before networking. Survives kernel
# panic. The host gets a tail-able file. No anaconda CLI flag, no
# kickstart change, just the QEMU virtio-serial wiring.
#
# We've lost logs three times in a row to anaconda failures + tmpfs
# reboots. Wiring this up so future failures auto-capture.
ANACONDA_LOG="$TEST_DIR/anaconda-vm-$(date +%Y%m%d-%H%M%S).log"
ANACONDA_LOG_ARGS=(
-chardev "file,id=anaclog,path=$ANACONDA_LOG"
-device virtio-serial-pci,id=vs1
-device "virtserialport,chardev=anaclog,bus=vs1.0,name=org.fedoraproject.anaconda.log.0"
)
echo " AnaLog: $ANACONDA_LOG"
echo " Mode : ${SECBOOT:+secboot}${SECBOOT:-stock UEFI}"
echo " Inject: ${HOST_PUBKEY:+yes}${HOST_PUBKEY:-no (no host pubkey)}"
echo "════════════════════════════════════════════════════════"
exec qemu-system-x86_64 \
-name veilor-os \
-enable-kvm \
-cpu host \
-smp 4 \
-m 4096 \
-machine q35,smm=on \
-global driver=cfi.pflash01,property=secure,value=on \
-drive if=pflash,format=raw,readonly=on,file="$OVMF_CODE" \
-drive if=pflash,format=raw,file="$NVRAM" \
-drive file="$DISK",if=virtio,format=qcow2,cache=writeback \
-drive file="$ISO",media=cdrom,readonly=on \
"${SEED_ARGS[@]}" \
"${MONITOR_ARGS[@]}" \
"${ANACONDA_LOG_ARGS[@]}" \
-boot menu=on,splash-time=2000 \
-netdev user,id=net0,hostfwd=tcp::2222-:22 \
-device virtio-net-pci,netdev=net0 \
-device virtio-rng-pci \
-vga virtio \
-display gtk,gl=on \
-audiodev pa,id=snd0 \
-device intel-hda \
-device hda-output,audiodev=snd0