"Populi GPU truth probe specification (NVML Layer A)"

Populi GPU truth probe specification (NVML Layer A)

This document implements the probe slice of ADR 018: Populi GPU truth layering: Layer A fields on NodeRecord (crates/vox-populi/src/node_registry.rs) populated from the driver when NVML is available.

Build / runtime

SurfaceBehavior
Default buildsNo NVML link. vox_repository::probe_nvidia_gpu_inventory_best_effort (crates/vox-repository/src/gpu_inventory.rs) returns None; join/heartbeat behave as before (env advertisement only).
vox-repository feature nvml-probeLinks nvml-wrapper. At runtime, Nvml::init() must succeed (NVIDIA driver + NVML present).
vox-populi feature nvml-gpu-probeEnables vox-repository/nvml-probe.
vox-cli feature mesh-nvml-probePulls vox-populi with NVML probe for operators who want inventory on node_record_for_current_process.

Typical build:

cargo build -p vox-cli --features populi,mesh-nvml-probe

Fields populated

When the probe succeeds, node_record_for_current_process (crates/vox-populi/src/lib.rs) sets:

  • gpu_total_count, gpu_healthy_count, gpu_allocatable_count — from NVML device enumeration (v1: healthy/allocatable match enumerated devices; refine with reservations in a later phase).
  • gpu_inventory_source"nvml".
  • gpu_truth_layer"layer_a_verified".
  • capabilities.min_vram_mb — minimum total VRAM in MiB across devices, only if not already set by config.

Heartbeat reconciliation

Operators should send the same [NodeRecord] shape on join and heartbeat (existing Populi HTTP contract). Rebuilding the record each tick via node_record_for_current_process (or equivalent) automatically refreshes Layer A after GPU hotplug, driver restart, or VM attach — subject to NVML visibility.

Layer B (allocatable after local reservations) and Layer C (labels/policy) remain separate; this spec does not merge operator lies with probe facts — ADR 018 precedence still applies when schedulers consume both.