Unified orchestration — SSOT
This document captures compatibility rules and opt-in migration toggles while MCP, CLI, and DeI share one orchestrator contract (vox-orchestrator).
Workspace journey store (Codex)
Repo-backed vox-mcp and vox-orchestrator-d open the primary VoxDb via connect_workspace_journey_optional (default .vox/store.db). Env: VOX_WORKSPACE_JOURNEY_STORE, VOX_WORKSPACE_JOURNEY_FALLBACK_CANONICAL (env SSOT). Daemon diagnostics: JSON-RPC method orch.workspace_journey (bind repository_id vs discovered repo).
Bridge / routing policy: Vox-first codegen remains the default MCP path (vox_generate_code, local inference server for vox generate); non-Vox edits stay bounded behind explicit tools and repository policy — see completion policy SSOT.
Journey envelope (v1): contracts/orchestration/journey-envelope.v1.schema.json is the machine SSOT for per-request metadata (journey_id, session_id, thread_id, trace/correlation ids, repository_id, origin_surface). MCP vox_chat_message embeds this shape in structured transcript payloads; CLI and daemon surfaces wire fields incrementally.
Canonical MENS dev journey (Codex): Tables developer_journey_definitions / developer_journey_steps (baseline fragment developer_journeys) seed canonical_journey.v1.greenfield_vox_mens_devloop. MCP vox_journey_canonical_steps returns ordered step_json rows when VoxDb is attached. Human-readable limitation ids for journey maturity live in contracts/journeys/limitations.v1.yaml.
DeI planning on the daemon: JSON-line DeI methods ai.plan.new, ai.plan.replan, ai.plan.status, and ai.plan.execute are handled on the vox-orchestrator-d stdio surface (orch_daemon::dei_dispatch); docs may still say vox-dei-d as the logical stdio peer. Persistent plan rows require the same Codex VoxDb handle the orchestrator was built with.
Ownership: who writes what
| Concern | Embedded MCP (vox-mcp) | vox-orchestrator-d (daemon) | VoxDb / Turso |
|---|---|---|---|
| Session chat transcript (RAM) | Orchestrator ContextStore in-process | Same process model per ADR 022 until RPC parity | — |
| Structured chat turns | chat_append_workspace_message + journey envelope v1 | Future orch.* parity for remote clients | conversation_messages, conversations |
Legacy chat_transcripts rows | MCP chat path (dual-write) | Not primary writer today | chat_transcripts |
| Workspace journey attach / diagnostics | connect_workspace_journey_optional, MCP tooling | JSON-RPC orch.workspace_journey | journey + repo bind rows |
Routing decisions (routing_decisions) | MCP chat / codegen tools; orchestrator AiTaskProcessor when DB attached | Same table when daemon shares DB | local-first SQLite |
| Unified routing experiment flag | — | — | VOX_UNIFIED_ROUTING (telemetry reason shape in vox-runtime::routing_telemetry) |
HITL Doubt Flow
The unified orchestrator integrates seamlessly with the vox-dei Human-In-The-Loop (HITL) crate. When agents detect ambiguity, they invoke the vox_doubt_task MCP tool. This transitions the task to TaskStatus::Doubted and emits a TaskDoubted event. The ResolutionAgent inside vox-dei then takes over to resolve the doubt with the user, submitting an audit report that hooks into the gamification system (vox-ludus). For structural details, see the canonical HITL Doubt Loop SSOT.
Contract surfaces
- Repo reconstruction campaigns: JSON Schema
contracts/orchestration/repo-reconstruction.schema.json; benchmark tiers and KPI guidance in repo reconstruction benchmark ladder. Remote task envelopes may include optionalexec_lease_idandcampaign_idfor mesh correlation (see ADR 017). - Types:
vox_orchestrator::contract—TaskCapabilityHints,SessionContractEnvelope,OrchestrationMigrationFlags(orchestration_v2_enabled,legacy_orchestration_fallback), MCP ↔ DeI plan tool alignment (MCP_PLAN_TOOL_NAMES,DEI_PLAN_METHODS_NEW_REPLAN_STATUS). - Runtime config:
vox_orchestrator::OrchestratorConfig— process-wide limits, Socrates gates, scaling knobs, and nestedorchestration_migration(OrchestrationMigrationFlags). Loaded fromVox.toml[orchestrator]andVOX_ORCHESTRATOR_*env overrides viaOrchestratorConfig::merge_env_overridesincrates/vox-orchestrator/src/config/.
Agent queue capabilities (TaskCapabilityHints)
On Orchestrator::spawn_agent, each new AgentQueue gets capabilities from merge_agent_capabilities (crates/vox-orchestrator/src/capability_probe.rs):
- Start from
default_agent_capabilitiesin config / TOML. - Overlay host probe via
probe_host_capabilities:cpu_cores(fromavailable_parallelism),arch(std::env::consts::ARCH),hostname(HOSTNAME/COMPUTERNAME, orsysinfowhen built withsystem-metrics). - Labels: config labels preserved first; probe-supplied labels appended without duplicates.
- GPU / NPU flags: operator config wins if already
true; otherwise probe may setgpu_cudawhenVOX_MESH_ADVERTISE_GPU=1|true(legacy workstation advertisement), orgpu_vulkan/gpu_webgpu/npufrom the matchingVOX_MESH_ADVERTISE_*vars (not driver probes). OptionalVOX_MESH_DEVICE_CLASSfillsdevice_class. See mobile / edge AI SSOT. min_vram_mb/min_cpu_cores: filled from probe only when unset in config.
Routing reads capability_requirements on tasks and applies GPU / VRAM / min_cpu_cores / prefer_gpu_compute soft penalties in crates/vox-orchestrator/src/services/routing.rs (mens / Mens-style training hints).
When MCP polls GET /v1/populi/nodes, each row becomes a RemotePopuliRoutingHint: if last_seen_unix_ms is older than orchestrator stale_threshold_ms at poll time, heartbeat_stale is set and experimental Populi routing signals skip that node (maintenance / quarantine were already excluded).
Optional VOX_ORCHESTRATOR_MESH_EXEC_LEASE_RECONCILE: same poll tick may call GET /v1/populi/exec/leases and compare each holder_node_id to the fresh node list (tracing target vox.mcp.populi_reconcile; Codex event mesh_exec_lease_reconcile when VOX_MESH_CODEX_TELEMETRY). Opt-in VOX_ORCHESTRATOR_MESH_EXEC_LEASE_AUTO_REVOKE performs POST /v1/populi/admin/exec-lease/revoke on mismatches (mesh/admin token; aggressive — see env SSOT).
See also mens SSOT for VOX_MESH_* and local registry.
Mesh distribution vs single-process embedding
- Embedding: Each
vox-mcp(orvox deiCLI) process constructs an in-memoryOrchestrator. That is “single-process gravity” for RAM-local queues and locks. - Distribution: With
VOX_MESH_ENABLED, durable coordination (locks, oplog mirror, A2A inboxes, heartbeats) is backed by Turso so another MCP or laptop can participate in the same logical mesh. Two nodes = two orchestrator instances sharing one cross-node SSOT via the DB and HTTP A2A relay — not one magic cluster master in RAM. - Bootstrap SSOT:
build_repo_scoped_orchestratorandbuild_repo_scoped_orchestrator_for_repositoryare the shared factory for MCP, CLI, and other embedders so repository id, affinity groups, and memory shard paths stay aligned.
For table-level detail and conflict rules, see Mens coordination.
A2A delivery planes
The orchestrator intentionally uses more than one delivery plane; these are not interchangeable transports with hidden semantics.
| Canonical plane | Current wire token(s) | Guarantees | Use for |
|---|---|---|---|
local_ephemeral | MCP route=local | in-process only, best-effort per-receiver FIFO, restart-volatile | low-latency same-node agent coordination |
local_durable | MCP route=db | durable row storage, explicit durable ack/poll semantics | cross-process local inboxes and persistence-friendly retries |
remote_mesh | MCP route=mesh, Populi HTTP A2A | HTTP relay with bearer/JWT auth, explicit inbox lease + ack, client-supplied idempotency | cross-node messaging and remote task envelopes |
broadcast | local bus broadcast, bulletin/event fanout | receiver-local ordering only, no shared durable semantics | fanout notifications |
stream | DeI JSON lines, vox-orchestrator-d orch.* JSON lines/TCP, MCP WS gateway, SSE, OpenClaw WS | ordered per connection/byte stream, reconnect semantics vary by transport | incremental output and live updates |
Machine-readable source of truth for these names lives in contracts/communication/protocol-catalog.yaml. MCP A2A responses surface the canonical plane names in addition to legacy wire tokens so callers can migrate without breaking compatibility.
Environment and config
OrchestratorConfig — VOX_ORCHESTRATOR_*
Boolean fields use Rust bool parsing (true / false only). Invalid values log a warning and leave the current setting unchanged.
| Variable | Maps to |
|---|---|
VOX_ORCHESTRATOR_ENABLED | enabled |
VOX_ORCHESTRATOR_MAX_AGENTS | max_agents |
VOX_ORCHESTRATOR_LOCK_TIMEOUT_MS | lock_timeout_ms |
VOX_ORCHESTRATOR_TOESTUB_GATE | toestub_gate |
VOX_ORCHESTRATOR_MAX_DEBUG_ITERATIONS | max_debug_iterations |
VOX_ORCHESTRATOR_SOCRATES_GATE_SHADOW | socrates_gate_shadow |
VOX_ORCHESTRATOR_SOCRATES_GATE_ENFORCE | socrates_gate_enforce |
VOX_ORCHESTRATOR_SOCRATES_REPUTATION_ROUTING | socrates_reputation_routing |
VOX_ORCHESTRATOR_SOCRATES_REPUTATION_WEIGHT | socrates_reputation_weight |
VOX_ORCHESTRATOR_TRUST_GATE_RELAX_ENABLED | trust_gate_relax_enabled — when true and Codex agent_reliability for the agent is ≥ trust_gate_relax_min_reliability, Socrates enforce, completion grounding enforce, and strict scope may skip completion requeue / enqueue denial (see PolicyTrustRelax). |
VOX_ORCHESTRATOR_TRUST_GATE_RELAX_MIN_RELIABILITY | trust_gate_relax_min_reliability — minimum reliability (default 0.85, aligned with trust auto-approve floor). |
VOX_ORCHESTRATOR_ATTENTION_ENABLED / VOX_ORCHESTRATOR_ATTENTION_BUDGET_MS / VOX_ORCHESTRATOR_ATTENTION_ALERT_THRESHOLD / VOX_ORCHESTRATOR_ATTENTION_INTERRUPT_COST_MS / VOX_ORCHESTRATOR_ATTENTION_TRUST_ROUTING_WEIGHT | Pilot attention budget + dynamic interruption gating (see information-theoretic-questioning.md, env-vars.md). Vox.toml also supports [orchestrator].interruption_calibration for per-channel gain offsets and backlog/trust calibration. |
VOX_ORCHESTRATOR_LOG_LEVEL | log_level (raw string) |
VOX_ORCHESTRATOR_FALLBACK_SINGLE | fallback_to_single_agent |
VOX_ORCHESTRATOR_MIN_AGENTS | min_agents |
VOX_ORCHESTRATOR_SCALING_THRESHOLD | scaling_threshold |
VOX_ORCHESTRATOR_IDLE_RETIREMENT_MS | idle_retirement_ms |
VOX_ORCHESTRATOR_SCALING_ENABLED | scaling_enabled |
VOX_ORCHESTRATOR_COST_PREFERENCE | cost_preference (performance | economy) |
VOX_ORCHESTRATOR_SCALING_LOOKBACK | scaling_lookback_ticks |
VOX_ORCHESTRATOR_RESOURCE_WEIGHT | resource_weight |
VOX_ORCHESTRATOR_RESOURCE_CPU_MULT | resource_cpu_multiplier |
VOX_ORCHESTRATOR_RESOURCE_MEM_MULT | resource_mem_multiplier |
VOX_ORCHESTRATOR_RESOURCE_EXPONENT | resource_exponent |
VOX_ORCHESTRATOR_SCALING_PROFILE | scaling_profile (conservative | balanced | aggressive) |
VOX_ORCHESTRATOR_MAX_SPAWN_PER_TICK | max_spawn_per_tick |
VOX_ORCHESTRATOR_SCALING_COOLDOWN_MS | scaling_cooldown_ms |
VOX_ORCHESTRATOR_URGENT_REBALANCE_THRESHOLD | urgent_rebalance_threshold |
VOX_ORCHESTRATOR_MIGRATION_V2_ENABLED | orchestration_migration.orchestration_v2_enabled |
VOX_ORCHESTRATOR_MIGRATION_LEGACY_FALLBACK | orchestration_migration.legacy_orchestration_fallback |
VOX_ORCHESTRATOR_MESH_CONTROL_URL | populi_control_url — HTTP base for GET /v1/populi/nodes (read-only); MCP vox_orchestrator_status includes mesh_snapshot JSON when set. Uses VOX_MESH_TOKEN on the client when present. Does not change task routing. |
VOX_ORCHESTRATOR_MESH_REMOTE_EXECUTE_EXPERIMENTAL | populi_remote_execute_experimental (TOML alias: mesh_remote_execute_experimental) — enables staged rollout for remote task-envelope dispatch over populi A2A relay (with local fallback). |
VOX_ORCHESTRATOR_MESH_REMOTE_LEASE_GATING_ENABLED | populi_remote_lease_gating_enabled (TOML: mesh_remote_lease_gating_enabled) — when true with matching roles, relay is awaited before local enqueue; success puts the task in remote-hold (single owner, no local dequeue). Relay failure deterministically falls back to local queue only (no fire-and-forget duplicate relay). |
VOX_ORCHESTRATOR_MESH_REMOTE_LEASE_GATED_ROLES | populi_remote_lease_gated_roles — comma-separated planner, builder, verifier, reproducer, researcher (case-insensitive). Empty list means no task matches gating. |
VOX_ORCHESTRATOR_MESH_REMOTE_RESULT_POLL_INTERVAL_SECS | populi_remote_result_poll_interval_secs (TOML alias: mesh_remote_result_poll_interval_secs) — remote_task_result inbox poll interval in seconds; 0 disables. Implemented in vox_orchestrator::a2a::spawn_populi_remote_result_poller (MCP and other embedders pass a join slot). |
VOX_ORCHESTRATOR_MESH_REMOTE_WORKER_POLL_INTERVAL_SECS | populi_remote_worker_poll_interval_secs (TOML alias: mesh_remote_worker_poll_interval_secs) — remote_task_envelope worker poll interval in seconds; 0 disables remote worker consumption while keeping result polling optional. Implemented in vox_orchestrator::a2a::spawn_populi_remote_worker_poller. |
VOX_ORCHESTRATOR_MESH_REMOTE_RESULT_MAX_MESSAGES_PER_POLL | populi_remote_result_max_messages_per_poll — per-page size when draining the parent mesh inbox for remote_task_result rows (minimum 1; default 64). The poller walks cursor pages (before_message_id, newest-first) up to a fixed cap so deep inboxes do not hide older results behind unrelated A2A mail. |
Populi client helpers now expose typed HTTP status errors (PopuliRegistryError::HttpStatus) and non-claimer inbox cursor paging (before_message_id, plus A2AInboxPager), so orchestrator fallback logic can branch on status codes (403/404/409) without brittle string matching.
Placement and lease observability (roadmap contract)
Phase 5 (scheduler unification) targets decision reason codes and structured fields so operators can audit why a task ran locally, on a lease-held remote worker, or on a cloud dispatch surface. Until code catches up, rely on the experimental toggles in the table above and on mens SSOT.
Documentation contract for eventual stable instrumentation (field names may differ slightly in Rust, but the concepts are stable):
| Field / concept | Purpose |
|---|---|
task_id | Correlate orchestrator task lifecycle across logs and traces. |
lease_id | Correlate remote execution with Populi lease records when ADR 017 semantics are implemented. |
placement_reason | Machine-readable code for the selected execution surface (local vs lease-remote vs cloud dispatch). |
populi_node_id / claimer_node_id | Mesh identity for inbox claims and execution attribution where applicable. |
Current stable placement_reason codes:
local_queue_defaultpopuli_remote_lease_holdlocal_queue_fallback_after_remote_relay_error
Rollout and kill switches: Populi remote execution rollout checklist. Work-type boundaries: placement policy matrix.
Other CLI / data plane
Canonical descriptions for VOX_BENCHMARK_TELEMETRY / VOX_SYNTAX_K_TELEMETRY (and related Codex row shapes) live in env-vars.md. Trust boundaries for optional telemetry: telemetry-trust-ssot.
| Variable | Purpose |
|---|---|
VOX_BENCHMARK_TELEMETRY | When 1 / true, CLI benchmark entry points append benchmark_event rows via VoxDb::record_benchmark_event. |
VOX_SYNTAX_K_TELEMETRY | When 1 / true, syntax-K benchmark classes append syntax_k_event rows via VoxDb::record_syntax_k_event (session syntaxk:<repository_id>). If unset, falls back to VOX_BENCHMARK_TELEMETRY. |
VOX_WORKFLOW_JOURNAL_CODEX_OFF | When 1 / true, skip Codex append for interpreted workflow journal rows. By default, when DB config resolves after vox workflow run / vox mens workflow run ( workflow-runtime ), Vox appends versioned workflow journal rows via VoxDb::record_workflow_journal_entry (session workflow:<repository_id>, metric workflow_journal_entry). Rows can include lifecycle events, retry events (ActivityAttemptRecovered, ActivityAttemptFailed, ActivityRetryScheduled), replay events, and per-step payloads (for example MeshActivity / MeshActivitySkipped) keyed by durable run_id + activity_id semantics described in durable execution. |
VOX_MESH_MAX_STALE_MS | Client-side filter for mens node lists in MCP snapshots (see mens SSOT). |
VOX_MESH_CODEX_TELEMETRY | When 1 / true, append populi_control_event rows via VoxDb::record_populi_control_event (session mens:<repository_id>): after vox run local registry publish when the CLI was built with populi (includes vox-populi), after vox-mcp startup publish when mens is enabled, and after MCP vox_orchestrator_status mens HTTP snapshot when Codex is connected. Implementation: vox_db::populi_registry_telemetry. Never stores VOX_MESH_TOKEN. |
VOX_MCP_LLM_COST_EVENTS | Optional override for MCP LLM CostIncurred bus events vs Codex-only accounting; see vox-mcp.md. |
VOX_REPOSITORY_ROOT | Optional directory for repository_id discovery in benchmark telemetry (and other CLI paths that adopt the same pattern); align with MCP’s discovered repo root when subprocess CWD differs. |
TOML: under [orchestrator], set orchestration_migration = { orchestration_v2_enabled = true, … } (field names match OrchestrationMigrationFlags in crates/vox-orchestrator/src/contract.rs). When v2 is enabled, MCP vox_submit_task success JSON may include orchestration_contract { "v2" as a client hint.
Optional [mens] in Vox.toml merges mens scope/URL/labels for CLI and MCP (see mens SSOT); env wins per field when set.
Effective Socrates thresholds still merge from vox-socrates-policy with optional overrides in OrchestratorConfig::socrates_policy — no literal drift outside the policy crate + merge logic.
Deprecation / compatibility matrix (current)
| Surface | Rule |
|---|---|
| MCP tool names | Add aliases before removing names; vox_plan, vox_replan, vox_plan_status stay stable. |
| DeI RPC ids | ai.plan.* method strings unchanged (vox_cli::dei_daemon::method). |
| Orchestrator daemon RPC ids | orch.* method strings are versioned in vox_protocol::orch_daemon_method; contract schema contracts/orchestration/orch-daemon-rpc-methods.schema.json. |
| File sessions + Codex | Both remain valid; MCP SessionManager uses with_db when Codex is attached. |
vox db | Remains implementation SSOT; vox scientia is a documented facade only. |
Related docs
- ADR 017: Populi lease-based remote execution — ownership model (design intent).
- ADR 018: Populi GPU truth layering — verified inventory vs labels.
- Populi work-type placement matrix — local / LAN / overlay policy.
external-repositories.md—repository_id, sessions, cache layout.socrates-protocol.md— Socrates telemetry and policy.mens-training.md— training backends and env.