Agent Diagnostic
Agent Diagnostic
Skills Loaded
Per CONTRIBUTING.md, the agent cloned https://github.com/NVIDIA/OpenShell.git and loaded/used the relevant repo skills:
debug-openshell-cluster
openshell-cli
create-github-issue
create-spike
It also used a local project runbook skill from prior investigation:
Environment Investigated
Target host: DGX Spark / GB10, aarch64
Observed:
hostname: spark-404e
OS: Ubuntu 24.04.3 LTS
kernel: 6.14.0-1013-nvidia
arch: aarch64
GPU: NVIDIA GB10
OpenShell: 0.0.56
GPU / IOMMU state:
GPU PCI BDF: 000f:01:00.0
IOMMU group: 20
IOMMU group type: DMA
GPU is alone in its IOMMU group
/dev/kvm exists
CPU OpenShell VM sandbox works
What The Agent Found In OpenShell Source
The VM GPU/QEMU path is currently x86-specific in crates/openshell-driver-vm/src/runtime.rs:
let mut qemu_cmd = StdCommand::new("qemu-system-x86_64");
qemu_cmd
.arg("-machine")
.arg("q35,accel=kvm")
The same path attaches the GPU as:
-device vfio-pci,host=<BDF>,bus=gpu_root
The VM driver GPU inventory logic worked correctly on this host:
GPU inventory initialized gpu_count=1
assigned GPU to sandbox bdf=000f:01:00.0 iommu_group=20
So the initial OpenShell-side discovery/assignment is working once the GPU is bound to vfio-pci.
What The Agent Tried
-
Stopped the running GPU workloads, which were Docker containers using the GB10.
-
Installed OpenShell 0.0.56 arm64 package.
-
Verified CPU VM sandbox works.
-
Bound 000f:01:00.0 to vfio-pci.
-
Started root OpenShell VM GPU gateway.
-
Confirmed gateway sees GPU:
GPU inventory initialized gpu_count=1
-
Tried openshell sandbox create --gpu.
Initial failure:
VM kernel not found: /root/.local/share/openshell/vm-runtime/0.0.56/vmlinux
-
Staged a temporary QEMU runtime with host kernel:
/root/.local/share/openshell/qemu-runtime-host-kernel/vmlinux
-
Patched cached VM rootfs with matching host modules/user-space pieces, similar to previous x86_64 bring-up:
overlay.ko
veth.ko
- NVIDIA kernel modules
- NVIDIA firmware
libcuda.so
libnvidia-ml.so
nvidia-smi
- policy/device-node fixes
-
Cloned OpenShell source at v0.0.56.
-
Patched VM runtime to use ARM QEMU on aarch64:
qemu-system-aarch64
-machine virt,accel=kvm,gic-version=3
-
Built a custom openshell-driver-vm and configured the gateway to use it.
This got past the hardcoded x86 QEMU issue. QEMU started, but VFIO then failed.
Current Blocker
QEMU/VFIO fails opening the GB10 device:
qemu-system-aarch64: -device vfio-pci,host=000f:01:00.0,bus=gpu_root:
vfio 000f:01:00.0: error getting device from group 20: Invalid argument
Verify all devices in group 20 are bound to vfio-<bus> or pci-stub and not already in use
Kernel log shows the more specific platform/IOMMU reason:
vfio-pci 000f:01:00.0: Firmware has requested this device have a 1:1 IOMMU mapping,
rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.
IOMMU group reserved regions include direct mappings:
/sys/kernel/iommu_groups/20/type: DMA
reserved_regions:
0x0000000008000000 0x00000000080fffff msi
0x00000000a1600000 0x00000000b97fffff direct
0x0000000200000000 0x0000000302ffffff direct
The agent also tried to set the IOMMU group type to identity, but the kernel rejected it:
echo identity > /sys/kernel/iommu_groups/20/type
write error: Operation not permitted
The agent also tried an explicit QEMU iommufd object, but this distro QEMU build does not support it:
qemu-system-aarch64: -object iommufd,id=iommufd0:
Parameter 'qom-type' does not accept value 'iommufd'
Conclusion
The agent confirmed two separate issues:
-
OpenShell VM GPU QEMU path is currently x86-specific and needs an aarch64 path:
qemu-system-aarch64
virt machine
- ARM-compatible device topology
-
After patching that locally, DGX Spark / GB10 passthrough is still blocked by platform firmware/IOMMU behavior:
- firmware requires 1:1 IOMMU mapping
- kernel refuses VFIO setup without that mapping
- userspace/OpenShell cannot enable the required mapping at runtime
The host was cleaned up after testing:
- GPU restored to
nvidia
- OpenShell test gateways stopped
- No GPU VM sandbox left running
Description
OpenShell VM GPU passthrough on DGX Spark / GB10 fails: after patching the VM driver to use ARM QEMU, QEMU reaches VFIO but cannot open the GPU.
The kernel rejects VFIO setup because firmware requires a 1:1 IOMMU mapping: rejecting configuring the device without a 1:1 mapping.
Expected: openshell sandbox create --gpu should boot an aarch64 QEMU MicroVM with the GB10 passed through, or clearly document that Spark/GB10 VFIO passthrough is unsupported.
Reproduction Steps
-
Use a DGX Spark / GB10 aarch64 host where the GPU is visible and alone in its IOMMU group:
- GPU BDF:
000f:01:00.0
- IOMMU group:
20
/dev/kvm exists
-
Install OpenShell 0.0.56 and verify a non-GPU VM sandbox works.
-
Stop GPU workloads, then bind the GPU to vfio-pci.
-
Patch/build openshell-driver-vm so the QEMU path uses:
qemu-system-aarch64
virt,accel=kvm,gic-version=3
-
Start a root OpenShell VM GPU gateway with OPENSHELL_VM_GPU=true.
-
Run:
openshell sandbox create -g spark-vm-gpu --from base --gpu --no-keep --no-tty -- uname -a
-
QEMU reaches VFIO but fails opening the device with the 1:1 IOMMU mapping error.
Environment
OS: Ubuntu 24.04.3 LTS (noble), aarch64
Kernel: 6.14.0-1013-nvidia
Docker: installed and running on the host
OpenShell: 0.0.56
Hardware: DGX Spark / GB10
GPU: NVIDIA GB10, PCI BDF 000f:01:00.0
IOMMU: GPU is alone in IOMMU group 20; group type is DMA
KVM: /dev/kvm exists
NVIDIA driver: 580.95.05, open kernel module
QEMU: host is aarch64; default OpenShell VM GPU path expected x86 QEMU until locally patched to qemu-system-aarch64
Logs
Agent-First Checklist
Agent Diagnostic
Agent Diagnostic
Skills Loaded
Per
CONTRIBUTING.md, the agent clonedhttps://github.com/NVIDIA/OpenShell.gitand loaded/used the relevant repo skills:debug-openshell-clusteropenshell-clicreate-github-issuecreate-spikeIt also used a local project runbook skill from prior investigation:
openshell-gpu-vmEnvironment Investigated
Target host: DGX Spark / GB10,
aarch64Observed:
GPU / IOMMU state:
What The Agent Found In OpenShell Source
The VM GPU/QEMU path is currently x86-specific in
crates/openshell-driver-vm/src/runtime.rs:The same path attaches the GPU as:
The VM driver GPU inventory logic worked correctly on this host:
So the initial OpenShell-side discovery/assignment is working once the GPU is bound to
vfio-pci.What The Agent Tried
Stopped the running GPU workloads, which were Docker containers using the GB10.
Installed OpenShell
0.0.56arm64 package.Verified CPU VM sandbox works.
Bound
000f:01:00.0tovfio-pci.Started root OpenShell VM GPU gateway.
Confirmed gateway sees GPU:
Tried
openshell sandbox create --gpu.Initial failure:
Staged a temporary QEMU runtime with host kernel:
Patched cached VM rootfs with matching host modules/user-space pieces, similar to previous x86_64 bring-up:
overlay.koveth.kolibcuda.solibnvidia-ml.sonvidia-smiCloned OpenShell source at
v0.0.56.Patched VM runtime to use ARM QEMU on
aarch64:Built a custom
openshell-driver-vmand configured the gateway to use it.This got past the hardcoded x86 QEMU issue. QEMU started, but VFIO then failed.
Current Blocker
QEMU/VFIO fails opening the GB10 device:
Kernel log shows the more specific platform/IOMMU reason:
IOMMU group reserved regions include direct mappings:
The agent also tried to set the IOMMU group type to
identity, but the kernel rejected it:The agent also tried an explicit QEMU
iommufdobject, but this distro QEMU build does not support it:Conclusion
The agent confirmed two separate issues:
OpenShell VM GPU QEMU path is currently x86-specific and needs an
aarch64path:qemu-system-aarch64virtmachineAfter patching that locally, DGX Spark / GB10 passthrough is still blocked by platform firmware/IOMMU behavior:
The host was cleaned up after testing:
nvidiaDescription
OpenShell VM GPU passthrough on DGX Spark / GB10 fails: after patching the VM driver to use ARM QEMU, QEMU reaches VFIO but cannot open the GPU.
The kernel rejects VFIO setup because firmware requires a 1:1 IOMMU mapping: rejecting configuring the device without a 1:1 mapping.
Expected: openshell sandbox create --gpu should boot an aarch64 QEMU MicroVM with the GB10 passed through, or clearly document that Spark/GB10 VFIO passthrough is unsupported.
Reproduction Steps
Use a DGX Spark / GB10
aarch64host where the GPU is visible and alone in its IOMMU group:000f:01:00.020/dev/kvmexistsInstall OpenShell
0.0.56and verify a non-GPU VM sandbox works.Stop GPU workloads, then bind the GPU to
vfio-pci.Patch/build
openshell-driver-vmso the QEMU path uses:qemu-system-aarch64virt,accel=kvm,gic-version=3Start a root OpenShell VM GPU gateway with
OPENSHELL_VM_GPU=true.Run:
QEMU reaches VFIO but fails opening the device with the 1:1 IOMMU mapping error.
Environment
OS: Ubuntu 24.04.3 LTS (
noble),aarch64Kernel:
6.14.0-1013-nvidiaDocker: installed and running on the host
OpenShell:
0.0.56Hardware: DGX Spark / GB10
GPU:
NVIDIA GB10, PCI BDF000f:01:00.0IOMMU: GPU is alone in IOMMU group
20; group type isDMAKVM:
/dev/kvmexistsNVIDIA driver:
580.95.05, open kernel moduleQEMU: host is
aarch64; default OpenShell VM GPU path expected x86 QEMU until locally patched toqemu-system-aarch64Logs
Agent-First Checklist
debug-openshell-cluster,debug-inference,openshell-cli)