"ADR 017: Populi lease-based authoritative remote execution"

ADR 017: Populi lease-based authoritative remote execution

Status

Accepted (design intent). This ADR records the intended execution-ownership model for Populi remote work. Until implementation and contract updates land, shipped behavior remains local-first with experimental best-effort relay only (see ADR 008 addendum and mens SSOT).

Context

Populi already provides membership, HTTP control plane operations, and A2A inbox semantics including claimer leases for mesh-delivered rows (mens SSOT). The orchestrator can emit best-effort RemoteTaskEnvelope traffic when experimental flags are set, but local queues still own execution today.

The first-wave personal-cluster roadmap needs a clear upgrade path from relay-style fan-out to authoritative remote ownership so that:

  • at most one worker owns execution of a given leased task class at a time,
  • long-running GPU work can renew leases and handle cancellation predictably,
  • partition or expiry yields a defined local fallback (or explicit failure) rather than silent double execution.

Decision

  1. Authoritative remote execution v1 uses a single-owner lease recorded by the Populi control plane (or equivalent durable coordinator): exactly one remote worker holds the lease for a given task / correlation id until release, expiry, revocation, or verified handoff (if ever added later).
  2. Transport for handoff, renew, cancel, and result correlation remains A2A over the Populi HTTP control plane unless a future ADR replaces ADR 008 as the default control transport. Lease state may also be exposed via additive HTTP APIs as contracts evolve.
  3. No work-stealing in v1: the scheduler does not preempt an active lease holder for another peer without an explicit future design.
  4. Local fallback is required for the leased task class when lease acquisition fails, renewal fails, the worker is unhealthy, or the lease expires without completion—unless operator policy explicitly opts into fail-closed behavior for that profile (documented per deployment).
  5. Promotion trigger: shipping behavior where remote execution correctness or SLA depends on Populi (not merely “extra logging” or “hinting”) is a breaking adoption of this ADR and must be accompanied by contract tests, rollout docs, and updates to mens SSOT and unified orchestration.

Non-goals (this ADR)

  • Default WAN distributed training or collective-heavy schedules.
  • Hosted multi-tenant GPU donation networks (ADR 009 remains the future-scope boundary).
  • Merging remote_mesh durability semantics with local_durable queue ownership without a separate ADR.

Consequences

  • Experimental relay flags remain best-effort and non-authoritative until implementation aligns with this ADR.
  • New OpenAPI fields and orchestrator gating are expected to be additive and off by default during rollout.
  • Operators gain a stable vocabulary: lease grant / renew / release / expiry, correlation id, single owner, fallback.