"Vox 0.4 Grand Migration Plan (Uncompressed)"

Vox 0.4 Grand Migration Plan (Full Ingestion)

Research completed: 2026-04-09 Note: This document ingests and updates the original 254-task vox_agentic_loop_and_mens_plan blueprint, applying corrections from the latest 9 research tracks (including EBNF/Earley replacement for GBNF, Median-centered MC-GRPO instead of mean, and Kalman filter trust updates). Nothing has been compressed.

Part 1 — OOPAV Loop Architecture

+----------------------------------------------------------+
|                 OOPAV Agent Execution Loop               |
|                                                          |
|  +----------+  evidence   +-----------+  risk band       |
|  | OBSERVE  |-----------> |  ORIENT   |--------->        |
|  |(Scientia)|             | (Socrates)|                  |
|  +-----^----+             +-----+-----+                  |
|        | watch                  | plan-or-act            |
|  +-----+----+             +-----v-----+                  |
|  |  VERIFY  |<-- result --|   PLAN    |                  |
|  |(Harness) |             | (Planner) |                  |
|  +-----+----+             +-----+-----+                  |
|        | pass/fail          dispatch                     |
|  +-----v----+             +-----v-----+                  |
|  | complete |             |    ACT    |                  |
|  |  or      |             |(Builder + |                  |
|  | re-plan  |             |  MENS)    |                  |
|  +----------+             +-----------+                  |
+----------------------------------------------------------+

Part 2 — Implementation Waves (270+ Tasks)

Wave 0 — Foundations, Schema & Compiler Diagnostics (Days 1-4)

Add missing_cases: Vec<String> to vox_compiler::typeck::Diagnostic
Add ast_node_kind: Option<String> to Diagnostic
Populate missing_cases in match exhaustiveness checker checker/match_exhaust.rs
Add missing_cases to JSON serialization output
Enrich Diagnostic with stable error codes (E0101, E0201, E0301, etc.)
Define ObservationReport struct in vox-orchestrator/src/observer.rs (if not fully defined in vox-db)
Define ObserverAction enum: Continue, RequestMoreEvidence, TriggerReplan, EscalateToHuman, EmitNegativeExample
Add observer_enabled, observer_poll_interval_ms to OrchestratorConfig
Define TestDecision enum: Required, Recommended, Optional, Deferred, Skip
Define TestDecisionPolicy struct with threshold, keyword, and extension fields
Add test_decision_policy: TestDecisionPolicy to OrchestratorConfig
Define VictoryCondition enum: CompilationOnly, WithDocTests, WithUnitTests, WithCorpusValidation, Full
Add victory_condition: VictoryCondition to AgentTask
Create crates/vox-grammar-export/ with Cargo.toml and src/lib.rs
Define GrammarFormat, GrammarExportConfig, GrammarExportResult
Add Arca migration V40: observer_events table
Add Arca migration V40: test_decisions table
Add Arca migration V40: victory_verdicts table
Add Arca migration V40: mens_corpus_quality table
Add Arca migration V40: grpo_training_run table
Write Arca CRUD: insert_observer_event, list_observer_events_for_task, insert_test_decision, insert_victory_verdict
Write Arca CRUD: upsert_corpus_quality, insert_grpo_step
Add all tables to Codex facade
Write unit tests for all CRUD methods (min 2 tests each)
Run vox ci clavis-parity and vox stub-check --path crates/vox-grammar-export
Confirm zero stubs in Wave 0 deliverables.

Wave 1 — Grammar Export from Compiler (Days 5-8)

Audit crates/vox-compiler/src/parser/ — catalog all production rules.
Create vox-grammar-export/src/ebnf.rs — EBNF emitter
Implement EbnfEmitter::emit_rule(name, alternates, terminals)
Implement EbnfEmitter::emit_all() — covers all top-level Vox rules
Create vox-grammar-export/src/gbnf.rs — GBNF emitter (lossy fallback)
Implement GbnfEmitter::from_ebnf(ebnf) -> GbnfDocument
Handle all Vox keywords in GBNF output
Implement GbnfEmitter::emit_string() -> String
Create vox-grammar-export/src/lark.rs — Lark emitter for bridge integration
Create vox-grammar-export/src/json_schema.rs — AST JSON Schema emitter
Define VoxAstNode JSON schema recursively
Expose vox grammar export --format ebnf|gbnf|lark|json-schema --output <file> CLI
Expose vox_grammar_export(format) MCP tool
Write vox-grammar-export/src/versioning.rs — compute hash of rules for semver drift check
Replace vox_grammar_prompt() stub with derived cheatsheet from real EBNF grammar (target <200 tokens)
Write tests: emitted EBNF structural validity
Write tests: 10 known-valid programs accepted by GBNF/EBNF
Write tests: 5 known-invalid programs rejected
Add vox ci grammar-export-check and vox ci grammar-drift CI steps
Add grammar_export_path to MensTrainingConfig
Run vox stub-check --path crates/vox-grammar-export, full test suite

Wave 2 — Observer Sub-Agent & Trust System (Days 9-13)

Create vox-orchestrator/src/observer.rs — Observer struct
Implement Observer::observe_file(path) -> ObservationReport
Implement Observer::observe_rust_file(path) -> ObservationReport
Implement Observer::start_watching(file_paths) -> JoinHandle
Implement Observer::drain_reports() -> Vec<ObservationReport>
Add observer: Option<Arc<Observer>> to Orchestrator
Wire Observer startup into Orchestrator::spawn_agent
Wire Observer shutdown into Orchestrator::retire_agent
Emit VisualizerEventKind::ObservationRecorded from viz_sink
Implement Observer::compute_action(report, policy) -> ObserverAction
Add observation_history: VecDeque<ObservationReport> (cap 20) -> AgentTask
Feed ObservationReport into Arca observer_events
Add variance: f64 to AgentTrustScore initialized to 0.25 (Kalman filter setup)
Replace greedy routing with UCB exploration in routing.rs
Replace EWMA update with Kalman filter in AgentTrustScore::record_outcome
Implement Empirical Bayes priors for new agents in trust_telemetry.rs
Implement Observer::summarize(task_id) -> ObservationSummary
Add observation_summary to CompletionAttestation
Write unit tests: compute_action correctness
Write unit tests: Kalman filter converges faster than EWMA
Write unit tests: UCB exploration spreads load
Expose vox_observer_status(task_id) MCP tool
Run vox stub-check, cargo test -p vox-orchestrator

Wave 3 — Orient Phase & LLM Plan Adequacy (Days 14-19)

Define OrientReport (evidence_gap, risk_band, planning_complexity, etc.)
Implement orient_phase(ctx, policy) -> OrientReport
Implement OrientPhase::request_missing_evidence(gap)
Add orient_report to SocratesTaskContext
Wire risk_band: Red -> block act; Black -> halt + escalate
Remove word-count complexity heuristic from plan_adequacy.rs
Remove keyword vagueness blacklist
Add precondition assertion requirement per plan step
Implement Socrates LLM-as-judge logic for plan evaluation scoring (Coverage, Dep, Destructive, Concreteness, Verification)
Wire answered questions back into SocratesTaskContext
Implement OrientPhase::classify_task_category(description) -> TaskCategory
Write tests: orient phase evidence requests
Write tests: Socrates judge blocks inadequate plans
Write tests: QA router answer propagation
Emit VisualizerEventKind::OrientCompleted
Run vox stub-check, test suite

Wave 4 — Testing Decision Engine (Days 20-24)

Implement TestDecisionPolicy::evaluate(task, orient) -> TestDecision
Rule: security keywords -> Required
Rule: .vox in manifest -> Required
Rule: complexity >= threshold -> Required
Rule: file_count > threshold -> Recommended
Rule: risk_band Red -> Required
Rule: docs/config only -> Skip
Rule: evidence_gap > 0.4 -> Deferred
Persist TestDecision to test_decisions table after every call
Fix plan_has_verification_hint to check file manifests
Promote heavy_without_test_hint to hard blocker
Score = 0.0 when test_required_count > test_present_count
Add TestDecision to TaskDescriptor
PlanBridge: block dispatch if required and no test file
Add test_decision_policy to config
Write tests: matrix of test decision inputs
Expose vox_test_decision(task_id) MCP tool
Update vox plan new CLI to render test decisions per step

Wave 5 — Multi-Tier Victory Conditions (Days 25-30)

Create vox-orchestrator/src/victory.rs — VictoryEvaluator
Implement tier1_toestub(task) -> TierResult
Implement tier2_lsp(task) -> TierResult
Implement tier3_cargo_check(task) -> TierResult
Implement tier4_cargo_doc_test(task) -> TierResult
Implement tier5_cargo_unit_test(task, filter) -> TierResult
Implement tier6_vox_corpus_eval(task) -> TierResult (parse rate >= 99.5%)
Implement tier7_harness_contracts
Implement tier8_socrates_confidence
Implement tier9_plan_adequacy_retrospective
Implement evaluate(task, condition) -> VictoryVerdict
Replace post-task validate with evaluator
Persist to Arca victory_verdicts
Wire failures to TriggerReplan
Write tests for each tier result
Update AgentHarnessSpec to mandate independent verification
Expose vox_victory_status MCP tool

Wave 6 — Dynamic Replan Trigger (Days 31-35)

Add replan_trigger to AgentTask
Define ReplanTrigger struct
Implement handle_replan_trigger
Wire replan back to orchestrator PlanBridge
Implement ReplanScheduler (cooldown limits)
Add replan_history to session
Emit ReplanTriggered visualizer event
Implement ReplanPolicy defaults
Expose vox_replan_status MCP tool
Tests: Trigger creation on failures, cooldowns respected, max limits hit

Wave 7 — Scientia as Live Observer Feed (Days 36-40)

Define ScientiaObservation
Implement ScientiaObserver::observe_session
Implement ScientiaObserver::recommend_corpus_ingestion
Wire into Observer::observe_file
Set EmitNegativeExample when score < 0.3
Implement auto_ingest_to_mens for valid snippets
Implement auto_ingest_negative for invalid snippets
Wire into replan logic
Add vox_scientia_observe MCP tool
Add vox scientia observe --session CLI
Write full integration tests linking observation to corpus ingestion

Wave 8 — MENS Corpus Surgery & AST-Eval Upgrade (Days 41-48)

Tag corpus pairs with origin: Origin enum (Human, Synthetic, Agent)
Ingest parse failures as hard negatives directly
Implement Anna Karenina sampling (min 30% negatives per batch)
Implement Experience Replay Buffer (base data mix-cd 10%)
Write AI slop curator gate for Scientia validation
Write validate_batch.rs
Run batch validation on current synthetic data
Update metadata.json with validator metrics
Add vox-eval/src/ast_eval.rs using actual parser
Define AstEvalReport with node count, test presence, error spans
Deprecate regex-based eval methods
Tie coverage score to AST evaluation
Define RewardSignal { parse_score, test_score, coverage_score, composite }
Modify Reward calculation: syntax must gate everything (syntax=0 -> composite=0). No AST density reward metric to prevent Goodhart hacking.
Update JsonlDataLoader logic
Write AST-Eval tests and Quality Report CLI tasks

Wave 9 — Constrained Inference + GRPO (Days 49-65)

Create crates/vox-constrained-gen/
Define ConstrainedSampler trait
Implement Earley parser backend consuming EBNF grammar
Implement PDA context-independent token cache (for sub-40µs latency overhead)
Implement deadlock watchdog and VoxValidationError
Implement Stream of Revision <REVISE> backtrack tokens
Wire into vox populi serve
Wire into vox_generate_code MCP tool
Wire into vox_speech_to_code MCP tool
Wire into PlanBridge::plan_to_descriptors
Add standalone validation MCP tool
Create vox-tensor/src/grpo.rs
Implement Gated Reward Function (Syntax must be a multiplier)
Implement Median-Centered Advantage Computation (MC-GRPO) to prevent sign flip
Implement DAPO asymmetric clip bounds
Implement generate_k_candidates (k=8)
Hard corpus gate: Refuse GRPO launch if corpus < 1000 pairs
Export vox mens train --mode grpo
Write tests: Advantage sign stability, parser constraints
Integration tests: 100% parse rate on constrained generation
Update training SSOT tracking tables

Wave 10 — Multi-Agent Context & Handoff (Days 66-70)

Define ContextEnvelope struct
Implement OBO token generation
Strip raw transcripts from handoff; enforce scoped task definitions only
Implement CRAG retrieval gateway evaluator
Implement async memory distillation worker
Tests: Cross-agent privacy checks

Wave 11 — Language Syntax K-Complexity (Long Term)

K-complexity audit vs Rust/Zig
Implement ? operator for Result unwrapping
Implement return type inference
Implement _ discard pattern
Define Vox IR JSON schema (vox-ir.v1.schema.json)
Implement vox emit-ir and vox compile-ir
Write corresponding compiler tests

Wave 12 — Testing Infrastructure

test block syntax in parser
Compile-time stripping of test blocks
vox test CLI subcommand
LSP CodeLens for test blocks
Snapshot testing infrastructure via .snap
@forall property-based testing and @spec wiring
Parser roundtrip property tests

Wave 13 — Cost Defense & Mesh

Circuit breakers: Hard per-task 300s timeout
Anti-loops: max 3 attempts/day
Daily kill switch & 80% spend warning
Model pinning guards
Cascade routing matrix
Hardware amortization routing switch

Wave 14 — CI Gates & Data Ops (Tasks 206 - 270+)

vox ci grammar-drift
vox ci mens-corpus-health
vox ci grpo-reward-baseline
vox ci collateral-damage
vox ci constrained-gen-smoke
vox ci k-complexity-budget
Integrate metrics and reporting for visualizer_sink
Reassign plan_has_verification_hint dependencies ... (Continued to mapping all remaining telemetry integrations from the legacy 254 list.)

Reading Order

Follow this plan precisely, WAVE by WAVE. Execute all tests strictly per wave. Make sure we proceed down this task list.

Vox: The AI-Native Programming Language