"ADR 018: Populi GPU truth layering"

ADR 018: Populi GPU truth layering

Status

Accepted (design intent). Defines how GPU-related fields on nodes and workers should be interpreted once a hardware-truth layer ships. Until then, mens continues to rely primarily on operator-set advertisement flags (for example VOX_MESH_ADVERTISE_GPU) as documented in mens SSOT and unified orchestration.

Context

Scheduling and routing need trustworthy signals: today, many GPU/NPU hints are declared by the operator or process environment, not verified as allocatable, healthy inventory. A GPU-mesh roadmap without a clear separation between facts, capacity, and policy invites silent mismatch (a node “advertises” CUDA while no device is usable).

Decision

Layer A — Verified hardware facts (probe-backed) { driver-visible devices, stable device ids where available, health signals derived from probes (or trusted agents), and observed memory / compute attributes. This layer is best-effort per platform but is the preferred source of truth when present.
Layer B — Allocatable capacity: what the node offers to remote or local schedulers after reservations, MIG/partitioning, thermal throttling, or local workloads. May differ from raw Layer A totals.
Layer C — Operator policy labels: non-authoritative tags for affinity, pools, regions, compliance classes, and cost tiers. Schedulers must not treat these as hardware guarantees.
Precedence: for correctness-critical placement (for example authoritative lease acquisition for GPU tasks), Layer A/B outrank Layer C when in conflict. Layer C may restrict or prefer candidates but must not invent capacity.
Additive contracts: new optional NodeRecord (and related) fields should encode which layer populated them where ambiguity would otherwise confuse clients. Unknown fields remain ignorable per extension-first rules in mens SSOT.

Consequences

Documentation and OpenAPI evolve to distinguish verified vs advertised GPU fields without breaking existing clients.
Routing and federation hints consume health + capacity from Layer A/B when available, falling back to legacy advertisement only when necessary.
Telemetry should eventually attribute placement decisions to which layer supplied the decisive signal (see placement observability).

ADR 017: lease-based remote execution — ownership model that should consume truthful capacity signals.
Work-type placement policy matrix.
Populi GPU truth probe specification (NVML Layer A) — shipped probe wiring and build features.
Populi GPU network research 2026 — evidence and gaps (research).

Vox: The AI-Native Programming Language

ADR 018: Populi GPU truth layering

Status

Context

Decision

Consequences

Related documentation