Vox corpus lab: mass examples, metrics, and eval harness (research 2026)
Executive summary
The corpus lab is an evidence pipeline, not a single script:
- Tier A — Checked-in
examples/golden/**/*.vox: CI gateall_golden_vox_examples_parse_and_lower(parse, HIR, WebIR validate, Syntax-K, runtime projection). See Golden examples corpus and examples README. - Tier B — Ephemeral, gitignored mass corpus under operator control: seeds, mutations, LLM outputs after
validate_generated_vox/ full frontend; must not be mdBook-included until promoted to Tier A (AGENTS.md documentation hygiene). - Tier C —
examples/parser-inventory/: negative fixtures; never mixed into Mens goldens.
Lanes: Any batch tool should expose at least diagnostics_only (cheap, parse/typecheck payloads) and golden_compatible (matches golden test expectations including WebIR validate). Optional: emit_ir, vox build matrix, screenshot + vision rubric research.
Strategic pillars (tie-back)
| Pillar | Corpus lab contribution |
|---|---|
| Language evidence | Token histograms, diagnostic taxonomies, WebIR lowering summaries, legacy_ast_nodes rate (must stay zero on success path). |
| Behavioral evidence | Optional Vite build, Playwright, screenshot digest + rubric JSON. |
| Model evidence | Same JSONL slice: compiler pass + Mens-served model quality (Mens training reference, Schola serve SSOT). |
| Operational evidence | Cost, wall time, artifact size; align with telemetry trust if persisted. |
Existing machinery (do not duplicate silently)
| Capability | Pointer |
|---|---|
| Full frontend | vox-compiler pipeline.rs — lex, parse, lower, typecheck, HIR validate. |
| MCP check | vox-mcp code_validator — check_file diagnostics JSON. |
| Golden gate | vox-compiler tests/golden_vox_examples.rs. |
| IR emission | IR emission SSOT — vox check --emit-ir vs vox build --emit-ir shapes differ. |
| Mens batch gate | Mens training data contract — validate-batch, quarantine. |
| WebIR backlog | Internal Web IR implementation blueprint. |
Generation strategies (research priorities)
- Template expansion from Tier A seeds — lowest garbage rate for WebIR stress.
- AST-aware mutation after successful parse — use
canonicalize_voxfor stable diffs. - Parser no-panic corpus expansion —
parser_corpus_no_panic.rsstyle strings; separate metrics bucket from “valid Vox”. - Synthetic JSONL —
vox-corpussynthetic_gen; optional emission of.voxfiles for compiler stats, not only Mens rows. - LLM round-trip — normalize fences (
generated_vox.rs), then compiler gate; failures feed trajectory repair lanes when enabled.
Eval harness (corpus × model)
Sketch for a future eval_report.json (schema to be versioned under contracts/eval/ when implemented):
- Inputs:
corpus_manifest.json(fixture ids, generator, compiler git SHA), optionalscreenshot_sha256, optionalvision_rubric.json. - Compiler metrics: pass/fail per lane, WebIR hash, Syntax-K event id or digest if emitted.
- Model metrics: same prompts run against baseline remote model and Mens-served adapter; record edit distance to canonical surface, parse pass after model edit (oracle loop), token cost if available.
- Regression: compare Qwen2-loaded vs Qwen3.5-loaded adapters on identical slice (Qwen family research).
Artifact layout (proposal)
Operator-local, gitignored root e.g. .vox/corpus-lab/ (exact name subject to vox ci artifact-audit alignment):
runs/<run_id>/manifest.jsonruns/<run_id>/per-fixture/<id>.diagnostics.jsonruns/<run_id>/per-fixture/<id>.web_ir.sha256(full JSON optional)runs/<run_id>/vision/<id>.rubric.json(optional)
CI posture
- Default CI: keep golden Tier A; optional nightly Tier B sampling without network.
- Browser / vision jobs:
[self-hosted, linux, x64, browser]per runner contract; behind env flags; no raw image bytes in uploaded CI artifacts without redaction policy.
See also
- GUI, v0/islands, vision, and Mens Qwen — virtuous-cycle implementation plan (2026)
- Mens vision and multimodal inputs (research 2026)
- Mens Qwen family migration (research 2026)
- Compiler IR pipeline
- Vox source → Mens pipeline SSOT
Open questions
- Single CLI owner (
vox ci corpus-labvsvox mens corpusextension) to avoid duplicate batch drivers. - Whether to reuse
syntax_k_eventschema only or definecorpus_lab_eventsibling incontracts/eval/. - Windows
target/lock contention policy for parallel batch runs (build environment guidance).