"Vox corpus lab: mass examples, metrics, and eval harness (research 2026)"

Vox corpus lab: mass examples, metrics, and eval harness (research 2026)

Executive summary

The corpus lab is an evidence pipeline, not a single script:

  • Tier A — Checked-in examples/golden/**/*.vox: CI gate all_golden_vox_examples_parse_and_lower (parse, HIR, WebIR validate, Syntax-K, runtime projection). See Golden examples corpus and examples README.
  • Tier B — Ephemeral, gitignored mass corpus under operator control: seeds, mutations, LLM outputs after validate_generated_vox / full frontend; must not be mdBook-included until promoted to Tier A (AGENTS.md documentation hygiene).
  • Tier Cexamples/parser-inventory/: negative fixtures; never mixed into Mens goldens.

Lanes: Any batch tool should expose at least diagnostics_only (cheap, parse/typecheck payloads) and golden_compatible (matches golden test expectations including WebIR validate). Optional: emit_ir, vox build matrix, screenshot + vision rubric research.

Strategic pillars (tie-back)

PillarCorpus lab contribution
Language evidenceToken histograms, diagnostic taxonomies, WebIR lowering summaries, legacy_ast_nodes rate (must stay zero on success path).
Behavioral evidenceOptional Vite build, Playwright, screenshot digest + rubric JSON.
Model evidenceSame JSONL slice: compiler pass + Mens-served model quality (Mens training reference, Schola serve SSOT).
Operational evidenceCost, wall time, artifact size; align with telemetry trust if persisted.

Existing machinery (do not duplicate silently)

CapabilityPointer
Full frontendvox-compiler pipeline.rs — lex, parse, lower, typecheck, HIR validate.
MCP checkvox-mcp code_validatorcheck_file diagnostics JSON.
Golden gatevox-compiler tests/golden_vox_examples.rs.
IR emissionIR emission SSOTvox check --emit-ir vs vox build --emit-ir shapes differ.
Mens batch gateMens training data contractvalidate-batch, quarantine.
WebIR backlogInternal Web IR implementation blueprint.

Generation strategies (research priorities)

  1. Template expansion from Tier A seeds — lowest garbage rate for WebIR stress.
  2. AST-aware mutation after successful parse — use canonicalize_vox for stable diffs.
  3. Parser no-panic corpus expansion — parser_corpus_no_panic.rs style strings; separate metrics bucket from “valid Vox”.
  4. Synthetic JSONLvox-corpus synthetic_gen; optional emission of .vox files for compiler stats, not only Mens rows.
  5. LLM round-trip — normalize fences (generated_vox.rs), then compiler gate; failures feed trajectory repair lanes when enabled.

Eval harness (corpus × model)

Sketch for a future eval_report.json (schema to be versioned under contracts/eval/ when implemented):

  • Inputs: corpus_manifest.json (fixture ids, generator, compiler git SHA), optional screenshot_sha256, optional vision_rubric.json.
  • Compiler metrics: pass/fail per lane, WebIR hash, Syntax-K event id or digest if emitted.
  • Model metrics: same prompts run against baseline remote model and Mens-served adapter; record edit distance to canonical surface, parse pass after model edit (oracle loop), token cost if available.
  • Regression: compare Qwen2-loaded vs Qwen3.5-loaded adapters on identical slice (Qwen family research).

Artifact layout (proposal)

Operator-local, gitignored root e.g. .vox/corpus-lab/ (exact name subject to vox ci artifact-audit alignment):

  • runs/<run_id>/manifest.json
  • runs/<run_id>/per-fixture/<id>.diagnostics.json
  • runs/<run_id>/per-fixture/<id>.web_ir.sha256 (full JSON optional)
  • runs/<run_id>/vision/<id>.rubric.json (optional)

CI posture

  • Default CI: keep golden Tier A; optional nightly Tier B sampling without network.
  • Browser / vision jobs: [self-hosted, linux, x64, browser] per runner contract; behind env flags; no raw image bytes in uploaded CI artifacts without redaction policy.

See also

Open questions

  1. Single CLI owner (vox ci corpus-lab vs vox mens corpus extension) to avoid duplicate batch drivers.
  2. Whether to reuse syntax_k_event schema only or define corpus_lab_event sibling in contracts/eval/.
  3. Windows target/ lock contention policy for parallel batch runs (build environment guidance).