Populi node lifecycle, drain, and GPU hotplug
This document captures the lifecycle model implied by today’s control plane and the gaps for automatic add/remove of GPUs and workers. It aligns with ADR 017 (execution ownership) and ADR 018 (GPU truth).
Current building blocks (shipped)
| Mechanism | Role |
|---|---|
NodeRecord.maintenance | Operator hint: drain-oriented “no new work” on the node record (interpreted by policy / gates). |
NodeRecord.quarantined | Server-side gate: rejects new A2A claims for that worker when set via admin API. |
join / heartbeat / leave | Membership freshness; heartbeat merges JSON fields into the registry. |
| Exec lease grant / renew | require_claimer_worker_gate: unknown node, quarantined, or maintenance → 403 (no new leases / no renew while draining). |
| Exec lease release | Holder must match lease row and node must still be registered; release is allowed under maintenance/quarantine so holders can clear scope_key during drain (see crates/vox-populi/src/transport/handlers.rs). |
| A2A inbox claim | Same maintenance/quarantine gates as experimental routing expects. |
| Stale filters | Client-side filter_registry_by_max_stale_ms on list responses; server-side prune knobs exist for operational tuning. |
Target behavior (personal cluster / lab)
-
Voluntary subtract (GPU or node)
- Operator sets
maintenance=trueon the node (or uses a future CLI) before retire. - In-flight tasks { exec lease renew stops once maintenance is set (403); holder should release to free the scope or let the lease expire. No new exec grants for that node while maintenance is on.
leaveor stopped heartbeat removes the node from the fresh view after stale threshold.
- Operator sets
-
Involuntary subtract (crash, cable pull)
- Heartbeat stops → node becomes stale in listings.
- Orchestrator: lease renewal fails → local fallback and cancel relay (existing poller path).
- Documented race: remote worker may still run briefly after partition — acceptable for experimental tier; fail-closed profiles need ADR 017 promotion.
-
GPU hot-add / hot-remove
- With NVML probe enabled, rebuilding
NodeRecordon heartbeat refreshesgpu_*_countand VRAM hints. - Schedulers must treat a drop in
gpu_allocatable_countor healthy count as a signal to stop routing new GPU tasks to that node (future unified scheduler). - No automatic “rebalance running tasks” in v1 — only new placement picks up new capacity.
- With NVML probe enabled, rebuilding
-
Drain vs quarantine
- Maintenance: cooperative drain; still visible; good-faith workers finish or cancel.
- Quarantine: hard stop for claim paths; use when a node is untrusted or broken.
Gaps (explicit backlog)
- CLI: Operator
vox populi admin maintenance|quarantine|exec-lease-revokeis shipped (featurepopuli;--control-url/ mesh control env; bearer viaPopuliHttpClient::with_env_token()/ Clavis mesh secrets). Timed drain uses optional--until-unix-ms/--for-minutes(maps tomaintenance_until_unix_ms/maintenance_for_msonPOST /v1/populi/admin/maintenance). Policy- or placement-driven unattended lease cleanup (rebalance, gang jobs) remains future work; operators canexec-lease-revokeby id, or use MCP opt-in below. - Optional MCP reconciliation (
VOX_ORCHESTRATOR_MESH_EXEC_LEASE_RECONCILE): after each node poll,GET /v1/populi/exec/leases+ holder vs registry check; traces + optional Codexmesh_exec_lease_reconcile. Opt-inVOX_ORCHESTRATOR_MESH_EXEC_LEASE_AUTO_REVOKEcalls admin exec-lease revoke on each bad-holder row (aggressive; mesh/admin bearer). Covered byvox-mcptestspopuli_mcp_http_join_startup(auto-revoke + reconcile-only negative case). - Topology-aware gang scheduling and NCCL-style jobs (out of scope for default WAN row in the placement matrix); granular tasks
p5-gang-nccl-pilot/p5-queued-capacity-rebalance/p5-placement-policyin GPU mesh implementation plan 2026.