"Vox Agentic Loop Overhaul + MENS Syntax-Intelligence Blueprint"

Vox Agentic Loop Overhaul + MENS Syntax-Intelligence Blueprint

Research completed: 2026-04-05 Two interlocked workstreams:

  1. Agentic Loop — Observe → Orient → Plan → Act → Verify (OOPAV)
  2. MENS Syntax Intelligence — Grammar-aware training, constrained inference, MCP pre-emit validation

Part 0 — Gap & Limitation Audit (20 Gaps)

#GapEvidence location
G-01No Observer role — nothing watches the environment between stepsorchestrator/agent_lifecycle.rs, planning/mod.rs
G-02Completeness declared too early — cargo check only, no cargo test or Vox parse-rate gatevalidation.rs:161-183
G-03Testing decision hard-wired — heavy_without_test_hint is a soft penalty, never blocksplan_adequacy.rs:321
G-04Plan complexity is word-count heuristic — caps at 9, under-detects complex refactorsplan_adequacy.rs:48-58
G-05Socrates gate is post-hoc — scoring happens after LLM commits, not beforesocrates.rs
G-06HarnessGate.independent_verification always falseharness.rs:244-250
G-07QARouter::answer() discards the answer — _answer: &str unusedqa.rs:55
G-08No autonomic replan trigger — only user-driven via vox_replanplanning/replan.rs
G-09Scaling ignores observer load / evidence qualityorchestrator/scaling.rs
G-10Scientia is a publication layer, not a live observation sourcevox-scientia-core/src/lib.rs
G-11MENS corpus only 340 pairs, 39 negativesmens/data/metadata.json
G-12vox_grammar_prompt() is a 27-line hand-written stubcompiler/src/llm_prompt.rs
G-13golden_validated.jsonl is 60 bytes (empty)mens/data/golden_validated.jsonl
G-14No grammar-constrained decoding at inferenceinference_and_serving.md
G-15vox-eval uses regex, not the real parservox_eval_crate.md
G-16No GRPO/RLVR training loop — SFT onlytraining_orchestration.md
G-17MCP code emit has no pre-validation before file writevox-mcp/
G-18vox_schola_submit failures not converted to negative examplesMCP tool vox_schola_submit
G-19plan_has_verification_hint ignores file manifestsplan_adequacy.rs:259-271
G-20fatigue_active penalty never propagated to planner thresholdssocrates.rs:271-276

Part 1 — OOPAV Loop Architecture

+----------------------------------------------------------+
|                 OOPAV Agent Execution Loop               |
|                                                          |
|  +----------+  evidence   +-----------+  risk band       |
|  | OBSERVE  |-----------> |  ORIENT   |--------->        |
|  |(Scientia)|             | (Socrates)|                  |
|  +-----^----+             +-----+-----+                  |
|        | watch                  | plan-or-act            |
|  +-----+----+             +-----v-----+                  |
|  |  VERIFY  |<-- result --|   PLAN    |                  |
|  |(Harness) |             | (Planner) |                  |
|  +-----+----+             +-----+-----+                  |
|        | pass/fail          dispatch                     |
|  +-----v----+             +-----v-----+                  |
|  | complete |             |    ACT    |                  |
|  |  or      |             |(Builder + |                  |
|  | re-plan  |             |  MENS)    |                  |
|  +----------+             +-----------+                  |
+----------------------------------------------------------+

Testing Decision Policy

Required    -> security/auth/schema keywords in description
Required    -> .vox file in manifest
Required    -> complexity >= 7 AND file_count > 2
Required    -> orient.risk_band == Red
Recommended -> new fn/type, >20 LOC estimate
Skip        -> docs-only or config-only manifest
Deferred    -> evidence_gap > 0.4
Optional    -> everything else

9-Tier Victory Conditions

TierCheckWhen
1TOESTUB — zero stubsAlways
2LSP zero errors on .vox write filesAlways
3cargo check --workspaceAlways
4cargo test --doc --workspaceWithDocTests or Full
5cargo test <filter>TestDecision::Required
6vox corpus eval parse_rate >= 99.5%Any .vox in manifest
7Harness contract satisfactionAlways
8Socrates confidence >= answer_thresholdAlways
9Plan adequacy retrospective >= 0.75Full

Part 2 — MENS Syntax Intelligence

Grammar Export Pipeline

vox-compiler/src/parser/
    |  VoxGrammarExporter
    |-> EBNF text       -> docs/grammar/vox.ebnf
    |-> GBNF file       -> llama.cpp --grammar-file
    |-> JSON Schema     -> vox populi serve (constrained JSON mode)

Corpus Verification Pipeline

synthetic.jsonl (3.2 MB, unverified)
    |  vox corpus validate-batch
    |-> synthetic_valid.jsonl   -> split=training
    |-> synthetic_invalid.jsonl -> split=negative + correction signal

golden_extracted.jsonl (16 KB)
    |  vox corpus validate-batch
    |-> golden_validated.jsonl  <- currently 60 bytes / EMPTY -> must reach >=500 pairs

GRPO/RLVR Training Loop

for each prompt in training_set:
  candidates = generate_k(prompt, k=8, temperature=0.8)
  for each candidate:
    r_syntax   = vox_parser(candidate)         -> 0/1
    r_test     = run @test blocks              -> pass_rate
    r_coverage = ast_eval(candidate).score
    reward     = 0.6*r_syntax + 0.3*r_test + 0.1*r_coverage
  advantage_i = reward_i - mean(rewards)       # GRPO group mean baseline
  grpo_update(policy, advantages)

MCP Pre-Emit Validation

vox_generate_code   -> mcp_pre_emit_validate("vox")
vox_speech_to_code  -> mcp_pre_emit_validate("vox")
PlanBridge step     -> mcp_pre_emit_validate("vox")
                             |
             parse OK?  -> write file
             parse ERR? -> VoxValidationError -> LLM retries
                        -> invalid snippet -> auto_ingest_negative(corpus)

Part 3 — Implementation Waves (254 Tasks)


Wave 0 — Foundations & Schema (Days 1-3)

  1. Define ObservationReport struct in vox-orchestrator/src/observer.rs
  2. Define ObserverAction enum: Continue, RequestMoreEvidence, TriggerReplan, EscalateToHuman, EmitNegativeExample
  3. Add observer_enabled, observer_poll_interval_ms to OrchestratorConfig
  4. Define TestDecision enum: Required, Recommended, Optional, Deferred, Skip
  5. Define TestDecisionPolicy struct with threshold, keyword, and extension fields
  6. Add test_decision_policy: TestDecisionPolicy to OrchestratorConfig
  7. Define VictoryCondition enum: CompilationOnly, WithDocTests, WithUnitTests, WithCorpusValidation, Full
  8. Add victory_condition: VictoryCondition to AgentTask
  9. Create crates/vox-grammar-export/ with Cargo.toml and src/lib.rs
  10. Define GrammarFormat, GrammarExportConfig, GrammarExportResult
  11. Add Arca migration V38: observer_events table
  12. Add Arca migration V38: test_decisions table
  13. Add Arca migration V38: victory_verdicts table
  14. Add Arca migration V38: mens_corpus_quality table
  15. Add Arca migration V38: grpo_training_run table
  16. Write Arca CRUD: insert_observer_event, list_observer_events_for_task, insert_test_decision, insert_victory_verdict, upsert_corpus_quality, insert_grpo_step
  17. Add all five tables to Codex facade
  18. Write unit tests for all CRUD methods (min 2 tests each)
  19. Run vox ci clavis-parity and vox stub-check --path crates/vox-grammar-export
  20. Confirm zero stubs in Wave 0 deliverables

Wave 1 — Grammar Export from Compiler (Days 4-7)

  1. Audit crates/vox-compiler/src/parser/ — catalog all production rules; write docs/src/architecture/vox-grammar-production-rules.md
  2. Create vox-grammar-export/src/ebnf.rs — EBNF emitter
  3. Implement EbnfEmitter::emit_rule(name, alternates, terminals)
  4. Implement EbnfEmitter::emit_all() — covers all top-level Vox rules
  5. Create vox-grammar-export/src/gbnf.rs — GBNF emitter for llama.cpp
  6. Implement GbnfEmitter::from_ebnf(ebnf) -> GbnfDocument
  7. Handle all Vox keywords in GBNF output
  8. Implement GbnfEmitter::emit_string() -> String
  9. Create vox-grammar-export/src/json_schema.rs — AST JSON Schema emitter
  10. Define VoxAstNode JSON schema recursively
  11. Expose vox grammar export --format ebnf|gbnf|json-schema --output <file> CLI
  12. Expose vox_grammar_export(format) MCP tool
  13. Write vox-grammar-export/src/versioning.rs — semver embedding + drift check
  14. Replace vox_grammar_prompt() stub with derived cheatsheet from real grammar
  15. Write tests: emitted EBNF structural validity
  16. Write tests: 10 known-valid programs accepted by the GBNF
  17. Write tests: 5 known-invalid programs rejected by the GBNF
  18. Add vox ci grammar-export-check CI step
  19. Add grammar_export_path to MensTrainingConfig
  20. Run vox stub-check --path crates/vox-grammar-export; full test suite

Wave 2 — Observer Sub-Agent (Days 8-12)

  1. Create vox-orchestrator/src/observer.rsObserver struct
  2. Implement Observer::observe_file(path) -> ObservationReport
  3. Implement Observer::observe_rust_file(path) -> ObservationReport
  4. Implement Observer::start_watching(file_paths) -> JoinHandle
  5. Implement Observer::drain_reports() -> Vec<ObservationReport>
  6. Add observer: Option<Arc<Observer>> to Orchestrator
  7. Wire Observer startup into Orchestrator::spawn_agent
  8. Wire Observer shutdown into Orchestrator::retire_agent
  9. Emit VisualizerEventKind::ObservationRecorded from viz_sink
  10. Implement Observer::compute_action(report, policy) -> ObserverAction
  11. Add observation_history: VecDeque<ObservationReport> (cap 20) -> AgentTask
  12. Feed ObservationReport into Arca observer_events
  13. Implement Observer::summarize(task_id) -> ObservationSummary
  14. Add observation_summary: Option<ObservationSummary> to CompletionAttestation
  15. Write unit tests: compute_action correctness
  16. Write integration test: Observer on known-bad .vox → errors within 2 polls
  17. Write integration test: Observer on .rs with todo!()EmitNegativeExample
  18. Write tests: summarize computes parse_rate trend from 3 sequential reports
  19. Expose vox_observer_status(task_id) MCP tool
  20. Run vox stub-check, cargo test -p vox-orchestrator

Wave 3 — Orient Phase & Enhanced Socrates (Days 13-17)

  1. Define OrientReport { evidence_gap, missing_namespaces, recommended_retrieval, risk_band, planning_complexity_multiplier }
  2. Implement orient_phase(ctx, policy) -> OrientReport
  3. Add evidence_gap_threshold to ConfidencePolicy
  4. Implement OrientPhase::request_missing_evidence(gap) -> Vec<SearchResult>
  5. Add orient_report: Option<OrientReport> to SocratesTaskContext
  6. Integrate orient_phase() into runtime.rs before each LLM inference request
  7. Wire risk_band: Red -> block act; Black -> halt + escalate
  8. Wire planning_complexity_multiplier into PlannerConfig
  9. Implement OrientPhase::propagate_fatigue(fatigue_active, config)
  10. Implement OrientPhase::auto_dispatch_socratic_question(gap) -> CorrelationId
  11. Fix QARouter::answer() — store answer; add get_answer(corr_id) -> Option<String>
  12. Wire answered questions back into SocratesTaskContext
  13. Implement OrientPhase::classify_task_category(description) -> TaskCategory
  14. Write tests: orient_phase with zero evidence -> RequestMoreEvidence
  15. Write tests: propagate_fatigue(true) raises thresholds by >= 2
  16. Write tests: classify_task_category returns Security for auth keywords
  17. Write tests: auto_dispatch_socratic_question creates QARouter entry
  18. Write tests: get_answer() returns stored string
  19. Emit VisualizerEventKind::OrientCompleted { risk_band, evidence_gap }
  20. Run vox stub-check, cargo test -p vox-orchestrator

Wave 4 — Testing Decision Engine (Days 18-22)

  1. Implement TestDecisionPolicy::evaluate(task, orient) -> TestDecision
  2. Rule: security keywords -> Required
  3. Rule: .vox in manifest -> Required
  4. Rule: complexity >= threshold -> Required
  5. Rule: file_count > threshold -> Recommended
  6. Rule: risk_band Red -> Required
  7. Rule: docs/config only -> Skip
  8. Rule: evidence_gap > 0.4 -> Deferred
  9. Rule: default -> Optional
  10. Persist TestDecision to test_decisions table after every call
  11. Fix plan_has_verification_hint to check file manifests
  12. Promote heavy_without_test_hint to hard blocker test_required_missing
  13. Add test_required_count, test_present_count to PlanAdequacySummary
  14. Score = 0.0 when test_required_count > test_present_count for coding goals
  15. Add TestDecision to TaskDescriptor
  16. PlanBridge: block dispatch if Required and no test file in manifest
  17. Add test_decision_policy to OrchestratorConfig with sane defaults
  18. Write tests: auth migration -> Required
  19. Write tests: markdown-only manifest -> Skip
  20. Write tests: complexity-8 .vox with no test step -> is_too_thin=true, test_required_missing
  21. Write tests: test file in manifest -> plan_has_verification_hint=true
  22. Write tests: PlanBridge blocks Required task with no test file
  23. Expose vox_test_decision(task_id) MCP tool
  24. Update vox plan new CLI to render test decisions per step
  25. Run vox stub-check, full test suite

Wave 5 — Multi-Tier Victory Conditions (Days 23-28)

  1. Create vox-orchestrator/src/victory.rsVictoryEvaluator
  2. Implement tier1_toestub(task) -> TierResult
  3. Implement tier2_lsp(task) -> TierResult
  4. Implement tier3_cargo_check(task) -> TierResult
  5. Implement tier4_cargo_doc_test(task) -> TierResult (120s timeout)
  6. Implement tier5_cargo_unit_test(task, filter) -> TierResult
  7. Implement tier6_vox_corpus_eval(task) -> TierResult (parse_rate >= 99.5%)
  8. Implement tier7_harness_contracts(task, harness) -> TierResult
  9. Implement tier8_socrates_confidence(task, ctx, policy) -> TierResult
  10. Implement tier9_plan_adequacy_retrospective(task) -> TierResult
  11. Implement VictoryEvaluator::evaluate(task, condition) -> VictoryVerdict
  12. Define VictoryVerdict { passed, tiers_run, first_failure, report }
  13. Replace post_task_validate with VictoryEvaluator::evaluate
  14. Persist every VictoryVerdict to Arca victory_verdicts
  15. Wire passed=false -> TriggerReplan via Observer
  16. Add max_victory_attempts: u32 to AgentTask (default 3)
  17. Emit VisualizerEventKind::VictoryEvaluated
  18. Update AgentHarnessSpec::minimal_contract_firstindependent_verification: true for code tasks
  19. Write tests: tier3 fails on bad Rust
  20. Write tests: tier6 fails on invalid Vox
  21. Write tests: Full passes for clean files + high confidence
  22. Write tests: stub code -> first_failure = TierResult::Toestub
  23. Write tests: max_victory_attempts guard
  24. Expose vox_victory_status(task_id) MCP tool
  25. Run vox stub-check, full test suite

Wave 6 — Dynamic Replan Trigger (Days 29-33)

  1. Add replan_trigger: Option<ReplanTrigger> to AgentTask
  2. Define ReplanTrigger { reason, failed_tier, observer_action, evidence_gaps }
  3. Implement runtime.rs::handle_replan_trigger(task, trigger)
  4. Wire replan result back into orchestrator via PlanBridge
  5. Add replan_count: u32 to AgentTask; fail permanently after max
  6. Implement ReplanScheduler — max 1 replan per 30s per session
  7. Implement ReplanScheduler::should_replan(task) -> bool
  8. Add replan_history: Vec<ReplanRecord> to PlanSession
  9. Define ReplanRecord { version, trigger_reason, previous_score, new_score, created_at }
  10. Emit VisualizerEventKind::ReplanTriggered
  11. Implement ReplanPolicy in planning/policy.rs
  12. Add replan_policy: ReplanPolicy to OrchestratorConfig
  13. Expose vox_replan_status(session_id) MCP tool
  14. Write tests: failed tier3 -> ReplanTrigger created -> replan called
  15. Write tests: ReplanScheduler returns false within cooldown
  16. Write tests: permanent failure after max replans
  17. Write tests: replan_history persisted and retrievable
  18. Write tests: MCP returns correct count and reason
  19. Update vox plan replan CLI
  20. Run full test suite, vox stub-check

Wave 7 — Scientia as Live Observer Feed (Days 34-38)

  1. Audit vox-scientia-* crates; write docs/src/architecture/scientia-surface-audit.md
  2. Define ScientiaObservation { session_id, source_path, worthiness_score, construct_coverage, citation_count, recommended_for_corpus, reason }
  3. Implement ScientiaObserver::observe_session(session_id) -> ScientiaObservation
  4. Implement ScientiaObserver::recommend_corpus_ingestion(obs) -> bool
  5. Wire into Observer::observe_file for .vox files
  6. Set EmitNegativeExample when worthiness_score < 0.3
  7. Implement ScientiaObserver::auto_ingest_to_mens(obs, codex) -> split=training row
  8. Implement ScientiaObserver::auto_ingest_negative(path, error, codex) -> split=negative row
  9. Wire into handle_replan_trigger — replans >= max/2 emit negatives
  10. Add scientia_observation: Option<ScientiaObservation> to ObservationReport
  11. Expose vox_scientia_observe(session_id) MCP tool
  12. Add vox scientia observe --session <id> CLI subcommand
  13. Write tests: recommend_corpus_ingestion true for valid snippet with 3 constructs
  14. Write tests: auto_ingest_to_mens inserts training row
  15. Write tests: auto_ingest_negative inserts negative row
  16. Write tests: full pipeline — Observer -> Scientia -> corpus row
  17. Emit VisualizerEventKind::ScientiaObserved
  18. Expose in VS Code extension telemetry push
  19. Update governance.md
  20. Run full test suite, vox stub-check

Wave 8 — MENS Corpus Surgery & AST-Eval Upgrade (Days 39-46)

  1. Write vox-corpus/src/validate_batch.rs — batch parse validation
  2. Run validate-batch on synthetic.jsonl -> synthetic_valid.jsonl + synthetic_invalid.jsonl
  3. Run validate-batch on golden_extracted.jsonl -> populate golden_validated.jsonl
  4. Update mens/data/metadata.json with parse_rate, last_validated_at, validator_version
  5. Implement vox-eval/src/ast_eval.rsast_eval(code) -> AstEvalReport using real parser
  6. Define AstEvalReport { parse_success, node_count, max_depth, construct_histogram, type_annotation_rate, has_tests, error_span }
  7. Implement AstEvalReport::coverage_score() — weighted composite
  8. Update vox-eval/src/lib.rs — re-export ast_eval; #[deprecated] on detect_constructs
  9. Update construct_coverage_score(code) to delegate to AST eval
  10. Update vox eval --mode ast CI integration
  11. Upgrade vox corpus eval to AST engine
  12. Define RewardSignal { parse_score, test_score, coverage_score, composite } in vox-tensor/src/data.rs
  13. Implement reward_signal_for_pair(pair) -> RewardSignal
  14. Add reward_signal: Option<RewardSignal> to TrainingPair
  15. Update JsonlDataLoader to compute RewardSignal during loading
  16. Add avg_reward_signal per split to metadata.json
  17. Add vox corpus quality-report CLI command
  18. Add mens/schemas/corpus_quality_record.schema.json
  19. MILESTONE GATE: golden_validated.jsonl >= 500 pairs required before Wave 9
  20. Write tests: ast_eval on valid Vox function -> parse_success=true
  21. Write tests: ast_eval on invalid snippet -> parse_success=false, non-None error_span
  22. Write tests: reward_signal_for_pair -> composite >= 0.8 for well-formed pair with tests
  23. Write tests: validate_batch correctly separates mixed JSONL
  24. Run vox stub-check --path crates/vox-eval, cargo test -p vox-eval

Wave 9 — Constrained Inference + GRPO Loop + MCP Pre-Emit (Days 47-60)

  1. Create crates/vox-constrained-gen/ — grammar-constrained token sampling
  2. Implement ConstrainedSampler::from_gbnf(gbnf_text) -> ConstrainedSampler (FSA from Wave 1 GBNF)
  3. Implement ConstrainedSampler::mask_logits(logits, state) -> FsaState
  4. Integrate into vox populi serve via ?grammar=vox or X-Vox-Grammar: true
  5. Add constrained_generation: bool to MensServeConfig
  6. Implement fallback: grammar deadlock -> VoxValidationError, request retry
  7. Create vox-constrained-gen/src/llguidance_bridge.rs (optional feature-gated)
  8. Define VoxValidationError { code, span, message, suggested_correction } in vox-compiler/src/error.rs
  9. Implement mcp_pre_emit_validate(code, format) -> Result<(), VoxValidationError> in vox-mcp/src/code_validator.rs
  10. Wire into vox_generate_code MCP tool
  11. Wire into vox_speech_to_code MCP tool
  12. Wire into PlanBridge::plan_to_descriptors for .vox steps
  13. Implement Rust pre-emit: rustc --parse-only subprocess on temp file
  14. Add vox_validate_code(code, language) -> { valid, errors } standalone MCP tool
  15. Implement MensGrpoTrainer::train_grpo(config, data) -> GrpoTrainingResult in vox-tensor/src/grpo.rs
  16. Define GrpoConfig { k_samples, temperature, reward_weights, policy_lr, clip_epsilon, max_steps }
  17. Define RewardWeights { parse_weight, test_weight, coverage_weight } defaults (0.6, 0.3, 0.1)
  18. Implement generate_k_candidates(prompt, model, k) -> Vec<String>
  19. Implement score_candidate(candidate) -> RewardSignal
  20. Implement compute_advantages(rewards) -> Vec<f32> (group mean baseline)
  21. Implement policy_gradient_update(model, candidates, advantages) (PPO-clip style)
  22. Expose vox mens train --mode grpo CLI flag
  23. Expose --k 8 --reward parse:0.6,test:0.3,coverage:0.1 arguments
  24. Add GRPO telemetry: group_rewards, mean_reward, policy_loss, clip_fraction per step
  25. Persist to Arca grpo_training_run table
  26. Define GrpoTrainingResult { steps_completed, final_mean_reward, parse_rate, checkpoint_path }
  27. Fix G-18: vox_schola_submit failures -> auto_ingest_negative
  28. Add vox mens eval --mode grpo-reward (dry-run)
  29. Add mens/config/grpo_default.toml (k=8, temp=0.8, max_steps=500)
  30. Write tests: compute_advantages correctness
  31. Write tests: constrained sampler produces only grammar-accepted tokens
  32. Write tests: mcp_pre_emit_validate -> error for missing closing }
  33. Write tests: mcp_pre_emit_validate -> Ok(()) for valid function
  34. Write tests: vox_validate_code -> errors for invalid Rust
  35. Write tests: GRPO loop completes 10 steps without panic on RTX 4080 SUPER
  36. Write tests: train --mode grpo -> checkpoint with final_mean_reward > 0.5
  37. Integration test: constrained generation -> 100% parse rate on 50 generations
  38. Integration test: invalid snippet via MCP -> VoxValidationError, no file written
  39. Integration test: GRPO model vs SFT baseline -> >= 5pp parse rate improvement
  40. Run vox stub-check --path crates/vox-constrained-gen crates/vox-mcp, cargo test --workspace
  41. Update docs/src/architecture/mens-training-ssot.md
  42. Update examples/STYLE.md
  43. Add vox ci grammar-constrained-gen-smoke-test
  44. Add vox ci mens-corpus-health
  45. Add vox ci grpo-reward-baseline
  46. Persist all CI results to Arca for trend analysis

Part 4 — Observability & Telemetry (241-245)

  1. Add ObservationReport to VS Code extension push-telemetry stream
  2. Color-code agent viz nodes by OrientReport.risk_band
  3. Add VictoryVerdict tier summary panel to workflow visualizer
  4. Add TestDecision badge to each task card
  5. Add RewardSignal.composite sparkline to MENS training progress panel

Part 5 — Documentation (246-254)

  1. Write docs/src/architecture/oopav-loop.md
  2. Write docs/src/architecture/observer-design.md
  3. Write docs/src/architecture/victory-conditions.md
  4. Write docs/src/architecture/test-decision-policy.md
  5. Write docs/src/architecture/mens-grammar-intelligence.md
  6. Update docs/src/architecture/mens-training-ssot.md
  7. Update docs/src/contributors/contributor-hub.md
  8. Update AGENTS.md
  9. Update docs/agents/governance.md

Milestone Gates

After WaveGate
0All V38 Arca migrations applied; vox stub-check clean across all new crates
1vox grammar export --format gbnf accepted by llama.cpp --grammar-file
2Observer: live LSP error detection on modified .vox file integration test passes
3Orient phase blocks Red band task from acting without evidence hydration
4Complexity-8 .vox task with no test step rejected by PlanBridge
5Full VictoryCondition::Full pass on a clean newly-generated Vox crate
6Autonomic replan triggered and completed on a simulated tier-3 failure
7mens_corpus_quality has >= 500 split=training rows from Scientia auto-ingestion
8golden_validated.jsonl >= 500 pairs; AST eval parse_rate >= 99.5%
9100 consecutive constrained-inference generations parse_rate = 100%; GRPO dry-run mean_reward > 0.4

Key Design Rationale

GBNF over Outlines/llguidance first: GBNF integrates natively with llama.cpp (already powering the local Populi server). llguidance added as optional bridge for dynamic grammars. Minimizes new dependencies.

AST eval over regex: Parse rate is binary. AstEvalReport provides a gradient signal — construct density, type annotation rate, test presence — enabling richer GRPO reward shaping.

GRPO over PPO: Eliminates the value network (critic), reducing memory ~40%. Critical under the 16 GB VRAM constraint on RTX 4080 SUPER. Group-relative baselines suit code generation's high candidate variance.

Observer separate from Verifier: Verifier is synchronous and post-hoc. Observer is asynchronous and continuous — allows Act to proceed without blocking while still delivering mid-flight course-corrections via TriggerReplan.

MCP pre-emit failures as negative examples: Each failure is high-signal teaching data. Invalid LLM-generated code becomes a structured negative pair (error = correction signal), closing the training loop organically without human annotation.