veilor-os/test/run-vm.sh

246 lines
9.3 KiB
Bash
Raw Normal View History

#!/usr/bin/env bash
# Boot veilor-os ISO in KVM/QEMU under UEFI.
# Usage:
# ./test/run-vm.sh # boots latest ISO from build/out
# ./test/run-vm.sh path/to.iso # specific ISO
# SECBOOT=1 ./test/run-vm.sh # use OVMF Secure Boot firmware
# FRESH=1 ./test/run-vm.sh # wipe disk + nvram, re-install from scratch
# NO_INJECT=1 ./test/run-vm.sh # skip SSH-key auto-injection
#
# SSH-key auto-injection (chosen approach: dual — cloud-init NoCloud + QEMU
# monitor sendkey fallback)
# ------------------------------------------------------------------
# Goal: previously each test required logging in at the QEMU console and
# running `passwd -d liveuser`, editing sshd_config, etc. before
# `ssh -p 2222 liveuser@localhost` worked. This script eliminates that.
#
# Primary path (works for the *installed* system, not the live image):
# * Detect host pubkey at ~/.ssh/id_ed25519.pub or ~/.ssh/id_rsa.pub
# * Build a NoCloud cloud-init ISO (user-data + meta-data) via mkisofs/xorriso
# * Mount it as a second virtual cdrom — Anaconda/cloud-init picks it up
# automatically when installing because the seed has the magic
# `cidata` volume label.
#
# Fallback path (works for the *live* image, which doesn't run cloud-init by
# default — dracut-live + livesys-scripts mount squashfs read-only and skip
# cloud-init.target):
# * Open a QEMU monitor unix socket (-monitor unix:...).
# * After ~90s (long enough for SDDM autologin → liveuser), background a
# helper that pipes a sequence of `sendkey` events to the monitor:
# Ctrl+Alt+F2 (drop to TTY)
# "sudo passwd -d liveuser && sudo systemctl reload sshd\n"
# This unblocks SSH on port 2222 without manual interaction.
#
# Both paths are best-effort; if the host has no pubkey, both are skipped
# and the script behaves exactly as before.
set -euo pipefail
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
TEST_DIR="$REPO_ROOT/test"
DISK="$TEST_DIR/veilor-vm.qcow2"
NVRAM="$TEST_DIR/veilor-vm.nvram"
SEED_ISO="$TEST_DIR/cloud-init-seed.iso"
MONITOR_SOCK="$TEST_DIR/veilor-vm.monitor.sock"
ISO="${1:-$(ls -t "$REPO_ROOT"/build/out/*.iso 2>/dev/null | head -1)}"
[[ -n ${ISO:-} && -f $ISO ]] || { echo "[ERR] No ISO found. Build first: ./build/build-iso.sh"; exit 1; }
# OVMF firmware selection
if [[ "${SECBOOT:-0}" == "1" ]]; then
OVMF_CODE=/usr/share/edk2/ovmf/OVMF_CODE.secboot.fd
OVMF_VARS_SRC=/usr/share/edk2/ovmf/OVMF_VARS.secboot.fd
NVRAM="$TEST_DIR/veilor-vm.nvram.secboot"
else
OVMF_CODE=/usr/share/edk2/ovmf/OVMF_CODE.fd
OVMF_VARS_SRC=/usr/share/edk2/ovmf/OVMF_VARS.fd
fi
# Reset on FRESH=1
if [[ "${FRESH:-0}" == "1" ]]; then
rm -f "$DISK" "$NVRAM" "$SEED_ISO"
fi
# Provision disk + per-VM nvram once
[[ -f $DISK ]] || qemu-img create -f qcow2 "$DISK" 40G
[[ -f $NVRAM ]] || cp "$OVMF_VARS_SRC" "$NVRAM"
# ── Locate host SSH pubkey (ed25519 preferred, rsa fallback) ──
HOST_PUBKEY=""
if [[ "${NO_INJECT:-0}" != "1" ]]; then
for cand in "$HOME/.ssh/id_ed25519.pub" "$HOME/.ssh/id_rsa.pub"; do
if [[ -f $cand ]]; then
HOST_PUBKEY="$(< "$cand")"
echo "[INFO] using host pubkey: $cand"
break
fi
done
fi
# ── Build cloud-init NoCloud seed ISO (primary path) ──
SEED_ARGS=()
if [[ -n $HOST_PUBKEY ]]; then
SEED_DIR="$(mktemp -d)"
trap 'rm -rf "$SEED_DIR"' EXIT
cat > "$SEED_DIR/meta-data" <<EOF
instance-id: veilor-test-vm
local-hostname: veilor-test
EOF
cat > "$SEED_DIR/user-data" <<EOF
#cloud-config
users:
- name: liveuser
ssh_authorized_keys:
- $HOST_PUBKEY
- name: admin
ssh_authorized_keys:
- $HOST_PUBKEY
lock_passwd: false
passwd:
ssh_pwauth: true
runcmd:
- rm -f /etc/ssh/sshd_config.d/10-veilor-hardening.conf
- systemctl reload sshd || systemctl restart sshd || true
EOF
# Build NoCloud ISO. Volume label MUST be "cidata" (case-insensitive)
# for cloud-init's NoCloud datasource to pick it up.
if command -v mkisofs >/dev/null 2>&1; then
mkisofs -quiet -output "$SEED_ISO" \
-volid cidata -joliet -rock \
"$SEED_DIR/user-data" "$SEED_DIR/meta-data"
elif command -v xorriso >/dev/null 2>&1; then
xorriso -as mkisofs -quiet -output "$SEED_ISO" \
-volid cidata -joliet -rock \
"$SEED_DIR/user-data" "$SEED_DIR/meta-data"
elif command -v cloud-localds >/dev/null 2>&1; then
cloud-localds "$SEED_ISO" "$SEED_DIR/user-data" "$SEED_DIR/meta-data"
else
echo "[WARN] no mkisofs/xorriso/cloud-localds — skipping cloud-init seed"
SEED_ISO=""
fi
if [[ -n $SEED_ISO && -f $SEED_ISO ]]; then
echo "[INFO] cloud-init seed ISO: $SEED_ISO"
SEED_ARGS=(-drive "file=$SEED_ISO,media=cdrom,readonly=on")
fi
fi
v0.5.27: rd.luks.uuid via grubby, GRUB rebrand, fbcon=nodefer, ASCII gum cursor Critical install bug fix + cosmetic round-up + first formal test procedure document. ## Critical: LUKS unlock on first boot Generated installer kickstart's %post was injecting `rd.luks.uuid=…` into `/etc/default/grub` only. Fedora 43 uses BLS (Boot Loader Specification) entries in `/boot/loader/entries/*.conf`; those are NOT regenerated by `grub2-mkconfig`. Result: the kernel boots without `rd.luks.uuid=`, dracut's cryptsetup-generator never spawns the unlock unit, plymouth has no password to ask for, and dracut-initqueue loops on dev-disk-by-uuid for ~3min before dropping to emergency shell. The fix layers both write paths: - `/etc/default/grub` — keeps the args around for future kernels (kernel-install reads this when adding new entries). - `grubby --update-kernel=ALL --args=...` — rewrites the `options` line of every existing BLS entry so the kernel that boots NEXT actually has the args. Verified by reading `/proc/cmdline` from the dracut emergency shell on a v0.5.26 install; old cmdline had only `root=UUID=… ro rootflags=subvol=root` and was missing the LUKS arg entirely. ## GRUB / branding - `/etc/default/grub` is sed'd to `GRUB_DISTRIBUTOR="veilor-os"` (was already there, kept). - BLS entries' `title` line is rewritten in-place to "veilor-os (<kver>)" for every kernel — `grub2-mkconfig` does not touch BLS titles, so this is the only path. - `/boot/loader/entries/*-0-rescue-*.conf` is removed: the auto-built rescue entry was leaking "Fedora Linux" into the GRUB menu and showing a second boot option that nobody asked for. The rescue kernel image itself is left in /boot. - Hostname defaults to `veilor` (was inheriting the `localhost-live` name anaconda writes when the kickstart's network directive is ignored under cmdline mode). - `/etc/machine-info` adds `PRETTY_HOSTNAME="veilor-os"` so `hostnamectl status` and any consumer reading machine-info see the brand. ## Boot UX - `fbcon=nodefer` added to live-ISO bootloader cmdline. On real laptops with a hardware GPU, the kernel modeset blanks the framebuffer console mid-boot; without `nodefer` the installer banner draws into a frozen framebuffer and the user sees a black screen with a blinking cursor for ~30s. virtio-vga in QEMU doesn't trigger this so it never reproduced in VM. Symptom report on v0.5.26 was the trigger to investigate. ## Installer cosmetics - `GUM_CHOOSE_CURSOR` and `GUM_INPUT_PROMPT` switched from `❯ ` to `> `. The unicode arrow falls back to a fixed-width block on the linux fbcon font and lipgloss then duplicates that block at col +23, producing the "Install Install" double-render and the stray-T artifact in password fields. Plain ASCII renders identically across fbcon, virtio-vga, and X/Wayland gum runs. - `VERSION_ID` bumped 0.5.8 → 0.5.27 in the os-release drop-in. The installer banner reads this at runtime, so the live ISO + installed system both now show "veilor-os 0.5.27". ## Test procedure - `test/TESTING.md` — first canonical test procedure document. Splits VM (cheap iteration, hybrid sendkey + human passwords) from real hardware (mandatory for tag). Documents the standard test passwords (`veilortest1` for both LUKS and admin), the kill-and-relaunch step to skip CD on second boot, and the per-step pass/fail contract. - `test/METHOD-CHANGELOG.md` — append-only audit trail for changes to the procedure. Future releases that alter the test method must add an entry here with the why. - `test/test-runs/_TEMPLATE.md` — per-run report template. Each tagged release should land a filled report alongside it. ## test/run-vm.sh Decoupled QEMU monitor sock setup from auto-inject. Previously `NO_INJECT=1` (used to suppress autotype noise into prompts) also killed the monitor sock, leaving the VM undriveable. Monitor sock is now always exposed; only the inject helper is gated on the pubkey detection.
2026-05-05 01:43:00 +01:00
# ── QEMU monitor unix socket ──
# Always exposed so the host can drive the VM via `socat - UNIX-CONNECT:...`
# (sendkey, screendump, etc.) for debugging. Independent of pubkey injection.
rm -f "$MONITOR_SOCK"
MONITOR_ARGS=(-monitor "unix:$MONITOR_SOCK,server,nowait")
# ── Auto-inject helper (live ISO doesn't run cloud-init) ──
# Started in the background after a delay; sends keypresses through the
# QEMU monitor unix socket to drop to a TTY and unblock SSH for liveuser.
if [[ -n $HOST_PUBKEY ]]; then
(
# Wait for the VM to reach a usable login prompt (SDDM autologin →
# liveuser session is the most realistic target). 90s is enough on
# KVM/4 vCPUs; tune via VM_BOOT_DELAY if needed.
sleep "${VM_BOOT_DELAY:-90}"
[[ -S $MONITOR_SOCK ]] || exit 0
# send_chord <key1> [key2 ...] — chord released between calls
send_chord() {
local IFS='-'
local chord="$*"
printf 'sendkey %s\n' "$chord"
}
# send_str <text> — only ASCII printable + space + return
send_str() {
local s="$1" ch
local i=0
while (( i < ${#s} )); do
ch="${s:i:1}"
case "$ch" in
' ') printf 'sendkey spc\n' ;;
[a-z0-9]) printf 'sendkey %s\n' "$ch" ;;
[A-Z]) printf 'sendkey shift-%s\n' "${ch,,}" ;;
'-') printf 'sendkey minus\n' ;;
'_') printf 'sendkey shift-minus\n' ;;
'/') printf 'sendkey slash\n' ;;
'.') printf 'sendkey dot\n' ;;
'&') printf 'sendkey shift-7\n' ;;
esac
i=$((i+1))
done
}
{
send_chord ctrl alt f2
sleep 1
# Type: liveuser <enter> (no password by default on live)
send_str "liveuser"
printf 'sendkey ret\n'
sleep 2
send_str "sudo passwd -d liveuser"
printf 'sendkey ret\n'
sleep 1
send_str "sudo systemctl reload sshd"
printf 'sendkey ret\n'
} | socat - "UNIX-CONNECT:$MONITOR_SOCK" 2>/dev/null || true
) &
INJECT_PID=$!
trap 'kill $INJECT_PID 2>/dev/null || true; rm -f "$MONITOR_SOCK"; rm -rf "${SEED_DIR:-}"' EXIT
fi
echo "════════════════════════════════════════════════════════"
echo " veilor-os :: VM test"
echo " ISO : $ISO"
echo " Disk : $DISK"
echo " NVRAM : $NVRAM"
echo " Seed : ${SEED_ISO:-<none>}"
v0.5.30: broad error suppression + manual bootloader + virtio log capture Three-layer fix for the persistent anaconda transaction failure that killed v0.5.28 (gen_grub_cfgstub) and v0.5.29 (aggregate dnf5 error). ## Layer 1: broad error suppression in transaction_progress.py dnf5 under RPM 6.0 + cmdline anaconda emits a final aggregate `error("transaction process has ended with errors..")` at end of transaction whenever its internal failure counter > 0, regardless of whether we suppressed individual script_error events. Reproduced twice. The narrow patch in v0.5.29 suppressed per-package errors but the aggregate still raised PayloadInstallationError and aborted the install before the bootloader phase ran. v0.5.30 patch turns the `elif token == 'error':` branch in process_transaction_progress into a log.warning. All four producers (cpio_error, script_error, unpack_error, generic error) now flow through to a warning + continue. Pattern matches both the original anaconda layout AND the v0.5.29 narrow-patched layout, so re-applying on top of either is a no-op. This brings us back to v0.5.28 broad-suppression behaviour. The side effect that bit us in v0.5.28 (silent grub2-efi-x64 scriptlet failure → empty /boot/efi/EFI/fedora/ → gen_grub_cfgstub fails) is addressed by Layer 2 below. ## Layer 2: bootloader install moved out of anaconda The generated install kickstart now has `bootloader --location=none`, which tells anaconda NOT to invoke its own bootloader install code path (and therefore NOT to call gen_grub_cfgstub). All grub work moves into the chroot %post block: 1. `dnf reinstall grub2-efi-x64 grub2-pc grub2-tools shim-x64 efibootmgr` — re-runs scriptlets in the chroot with full PID 1 systemd state, so the systemd-run-style triggers that anaconda's chroot truncates actually execute. 2. `grub2-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=fedora --no-nvram` — populates /boot/efi/EFI/fedora/ 3. `gen_grub_cfgstub /boot/grub2 /boot/efi/EFI/fedora` (or `grub2-mkconfig` fallback) — writes /boot/efi/EFI/fedora/grub.cfg. 4. `efibootmgr -c -d <disk> -p <part> -L "veilor-os" -l \EFI\fedora\shimx64.efi` — registers the NVRAM boot entry pointing at the signed shim. Each step logs to stdout and continues on failure (`set +e` block); diagnostics surface in the install log without aborting the whole %post. ## Layer 3: virtio-serial log capture in run-vm.sh Anaconda 43.x autodetects `/dev/virtio-ports/org.fedoraproject.anaconda.log.0` and streams program/packaging/storage/anaconda logs through it in real time, before any tmpfs / pivot, before networking, surviving kernel panic. Wiring it into run-vm.sh means the host gets a tail-able log file at `test/anaconda-vm-YYYYMMDD-HHMMSS.log` for every VM run. We've lost logs three times in a row to anaconda failures + tmpfs reboots. This breaks the loop. ## Diagnostic story Before this commit: VM aborts → live ISO reboots itself → /tmp/ tmpfs gone → no logs → guess what failed. Three days, two and a half false fixes. After this commit: VM aborts → host has /home/admin/ai-lab/_github/veilor-os/test/anaconda-vm-*.log with the actual scriptlet output, the actual exit codes, the actual file-trigger failures. Future debug becomes evidence-based. Files changed: kickstart/veilor-os.ks — broad error suppression patch overlay/usr/local/bin/veilor-installer — --location=none + manual grub test/run-vm.sh — virtio-serial chardev wiring Verified: bash -n clean, ksvalidator clean.
2026-05-05 11:59:35 +01:00
# Anaconda virtio-serial log channel.
#
# Anaconda 43.x autodetects /dev/virtio-ports/org.fedoraproject.anaconda.log.0
# and streams program/packaging/storage/anaconda logs through it in real
# time, before any tmpfs / pivot, before networking. Survives kernel
# panic. The host gets a tail-able file. No anaconda CLI flag, no
# kickstart change, just the QEMU virtio-serial wiring.
#
# We've lost logs three times in a row to anaconda failures + tmpfs
# reboots. Wiring this up so future failures auto-capture.
ANACONDA_LOG="$TEST_DIR/anaconda-vm-$(date +%Y%m%d-%H%M%S).log"
ANACONDA_LOG_ARGS=(
-chardev "file,id=anaclog,path=$ANACONDA_LOG"
-device virtio-serial-pci,id=vs1
-device "virtserialport,chardev=anaclog,bus=vs1.0,name=org.fedoraproject.anaconda.log.0"
)
echo " AnaLog: $ANACONDA_LOG"
echo " Mode : ${SECBOOT:+secboot}${SECBOOT:-stock UEFI}"
echo " Inject: ${HOST_PUBKEY:+yes}${HOST_PUBKEY:-no (no host pubkey)}"
echo "════════════════════════════════════════════════════════"
exec qemu-system-x86_64 \
-name veilor-os \
-enable-kvm \
-cpu host \
-smp 4 \
-m 4096 \
-machine q35,smm=on \
-global driver=cfi.pflash01,property=secure,value=on \
-drive if=pflash,format=raw,readonly=on,file="$OVMF_CODE" \
-drive if=pflash,format=raw,file="$NVRAM" \
-drive file="$DISK",if=virtio,format=qcow2,cache=writeback \
-drive file="$ISO",media=cdrom,readonly=on \
"${SEED_ARGS[@]}" \
"${MONITOR_ARGS[@]}" \
v0.5.30: broad error suppression + manual bootloader + virtio log capture Three-layer fix for the persistent anaconda transaction failure that killed v0.5.28 (gen_grub_cfgstub) and v0.5.29 (aggregate dnf5 error). ## Layer 1: broad error suppression in transaction_progress.py dnf5 under RPM 6.0 + cmdline anaconda emits a final aggregate `error("transaction process has ended with errors..")` at end of transaction whenever its internal failure counter > 0, regardless of whether we suppressed individual script_error events. Reproduced twice. The narrow patch in v0.5.29 suppressed per-package errors but the aggregate still raised PayloadInstallationError and aborted the install before the bootloader phase ran. v0.5.30 patch turns the `elif token == 'error':` branch in process_transaction_progress into a log.warning. All four producers (cpio_error, script_error, unpack_error, generic error) now flow through to a warning + continue. Pattern matches both the original anaconda layout AND the v0.5.29 narrow-patched layout, so re-applying on top of either is a no-op. This brings us back to v0.5.28 broad-suppression behaviour. The side effect that bit us in v0.5.28 (silent grub2-efi-x64 scriptlet failure → empty /boot/efi/EFI/fedora/ → gen_grub_cfgstub fails) is addressed by Layer 2 below. ## Layer 2: bootloader install moved out of anaconda The generated install kickstart now has `bootloader --location=none`, which tells anaconda NOT to invoke its own bootloader install code path (and therefore NOT to call gen_grub_cfgstub). All grub work moves into the chroot %post block: 1. `dnf reinstall grub2-efi-x64 grub2-pc grub2-tools shim-x64 efibootmgr` — re-runs scriptlets in the chroot with full PID 1 systemd state, so the systemd-run-style triggers that anaconda's chroot truncates actually execute. 2. `grub2-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=fedora --no-nvram` — populates /boot/efi/EFI/fedora/ 3. `gen_grub_cfgstub /boot/grub2 /boot/efi/EFI/fedora` (or `grub2-mkconfig` fallback) — writes /boot/efi/EFI/fedora/grub.cfg. 4. `efibootmgr -c -d <disk> -p <part> -L "veilor-os" -l \EFI\fedora\shimx64.efi` — registers the NVRAM boot entry pointing at the signed shim. Each step logs to stdout and continues on failure (`set +e` block); diagnostics surface in the install log without aborting the whole %post. ## Layer 3: virtio-serial log capture in run-vm.sh Anaconda 43.x autodetects `/dev/virtio-ports/org.fedoraproject.anaconda.log.0` and streams program/packaging/storage/anaconda logs through it in real time, before any tmpfs / pivot, before networking, surviving kernel panic. Wiring it into run-vm.sh means the host gets a tail-able log file at `test/anaconda-vm-YYYYMMDD-HHMMSS.log` for every VM run. We've lost logs three times in a row to anaconda failures + tmpfs reboots. This breaks the loop. ## Diagnostic story Before this commit: VM aborts → live ISO reboots itself → /tmp/ tmpfs gone → no logs → guess what failed. Three days, two and a half false fixes. After this commit: VM aborts → host has /home/admin/ai-lab/_github/veilor-os/test/anaconda-vm-*.log with the actual scriptlet output, the actual exit codes, the actual file-trigger failures. Future debug becomes evidence-based. Files changed: kickstart/veilor-os.ks — broad error suppression patch overlay/usr/local/bin/veilor-installer — --location=none + manual grub test/run-vm.sh — virtio-serial chardev wiring Verified: bash -n clean, ksvalidator clean.
2026-05-05 11:59:35 +01:00
"${ANACONDA_LOG_ARGS[@]}" \
-boot menu=on,splash-time=2000 \
-netdev user,id=net0,hostfwd=tcp::2222-:22 \
-device virtio-net-pci,netdev=net0 \
-device virtio-rng-pci \
-vga virtio \
-display gtk,gl=on \
-audiodev pa,id=snd0 \
-device intel-hda \
-device hda-output,audiodev=snd0