"Native ML Training Pipeline"

Native ML Training Pipeline

Vox "dogfoods" itself: the language, compiler, and documentation all feed a native machine learning loop that trains the Mens code assistant model.

End-to-end map from .vox sources through goldens and corpus extraction to model inputs: Vox source → Mens pipeline SSOT. Training pair contract: Mens training data contract.

Canonical operator fine-tuning: vox mens train with Candle + qlora-rs on Hugging Face weights. --backend qlora and --tokenizer hf are the defaults; no Python training loop. SSOT: Mens native training. PopuliTrainBackend::BurnLora is rejected at runtime in this dispatch — the supported trainer is CandleQlora.

Legacy / side paths: A Burn + wgpu scratch LoRA stack still lives in vox-tensor (vox training native, small VoxTokenizer model) — no Python, optional CUDA only if you build GPU features for other subsystems. Use it for experimentation, not as a substitute for Mens HF QLoRA. Burn also matters for vox mens merge-weights and vox mens serve on merged .bin checkpoints. Objectives and artifacts differ from Candle QLoRA — see Burn vs QLoRA.

GPUs: For QLoRA on an NVIDIA workstation, build mens-candle-cuda and use vox mens train --device cuda. For Burn scratch training, wgpu (Vulkan / DX12 / Metal) is the default GPU path. Use CPU when drivers or CI forbid GPU.


Architecture

┌─────────────────────────────────────────────────────────────┐
│  DATA SOURCES                                               │
│  golden/**/*.vox + examples.ssot.v1.yaml ──┐                │
│  docs … golden .vox ───┤──► vox mens corpus extract         │
│    (+ prose per mix policy)│         │                      │
│  vox-cli generate-data ───┘         │                       │
└─────────────────────────────────────│───────────────────────┘
                                      ▼
┌─────────────────────────────────────────────────────────────┐
│  CORPUS PIPELINE                                            │
│  mens/data/validated.jsonl   (raw Vox → instruction pairs)│
│        │                                                    │
│        ▼                                                    │
│  vox mens corpus validate    (filter malformed pairs)     │
│        │                                                    │
│        ▼                                                    │
│  mens/data/train.jsonl       (rated + filtered pairs)     │
└─────────────────────────────────────│───────────────────────┘
                                      ▼
┌─────────────────────────────────────────────────────────────┐
│  TRAINING (Mens — canonical)                                │
│                                                             │
│  **`vox mens train`** — Candle + **qlora-rs** QLoRA (default) │
│  `--backend qlora` + `--tokenizer hf` + HF safetensors      │
│  Optional **CUDA** (`mens-candle-cuda`) / **Metal**          │
│  SSOT: `reference/mens-training.md`                         │
│                                                             │
│  Legacy / other: `vox training native` — Burn scratch LoRA  │
│  (`VoxTokenizer` JSONL, wgpu/CPU). Not `vox mens` dispatch.   │
│  `vox train` (mens-dei): local bails → `vox mens train …`   │
└─────────────────────────────────────────────────────────────┘
                                      ▼
┌─────────────────────────────────────────────────────────────┐
│  EVAL + BENCHMARK GATES                                     │
│  vox mens corpus eval … → eval_results.json               │
│  VOX_BENCHMARK=1 → spawns vox mens eval-local (held-out)  │
│  Targets: vox_parse_rate ≥70%, coverage ≥50% (CI); VOX_EVAL_STRICT=1 fails promotion │
│  Held-out: VOX_BENCHMARK=1, VOX_BENCHMARK_MIN_PASS_RATE (default 0) │
└─────────────────────────────────────────────────────────────┘

Data Schema

All training pairs follow this JSONL schema (must match across all tools):

{
  "prompt": "Write a minimal Vox program that prints hello",
  "response": "fn main() {\n    print(\"hello\")\n}\n",
  "category": "function",
  "rating": 5,
  "schema_version": "vox_dogfood_v1"
}
FieldTypeRequiredDescription
promptstringThe instruction/question (serde also accepts instruction)
responsestringValid Vox code (serde also accepts output)
categorystringrecommendedConstruct type (function, actor, etc.)
ratingu8 1-5recommendedQuality rating; 5=ground truth docs
schema_versionstringoptionalVersion for migration tracking

Tokenizer (training vs compile)

Compile path: source text is lexed by vox-compiler (logos Token enum)—this is unrelated to Mens model vocabulary. See Vox source → Mens pipeline SSOT.

Mens QLoRA path (default): supervised strings are tokenized with the Hugging Face tokenizer for the chosen --model (tens of thousands of BPE tokens). See Mens native training § Tokenization SSOT.

Lab / Burn scratch: vox-tensor exposes a deterministic small VoxTokenizer (not a mirror of the Vox lexer keyword set):

  • 95 printable ASCII characters (IDs 3-97)
  • 35 Vox compound tokens (workflow, actor, fn , @island, etc.)
  • 3 control tokens: [PAD]=0, [UNK]=1, [EOS]=2
  • Total vocab: 133 tokens
// vox:skip
// Vox example — tokenized natively using VoxTokenizer
fn greet(name: str) -> str {
    return "Hello, " + name
}

Encoding uses greedy longest-match on compound tokens before falling back to single chars.


VoxTransformer Architecture (Burn scratch path)

The Burn-backed scratch transformer (crates/vox-tensor/src/vox_nn.rs, gpu feature) used with VoxTokenizer JSONL — distinct from HF QLoRA weights:

ParameterValueNotes
Layers12Transformer encoder blocks
Attention heads8Multi-head self-attention
Model dimension512Embedding size
FFN dimension2048Feed-forward inner size
Dropout0.1Applied in attention + FFN
Max sequence length512Tokens per training example
Vocab size133VoxTokenizer vocabulary

Running the Pipeline

1. Generate synthetic training data

vox generate-data --limit 500 --output mens/data/train.jsonl

2. Extract corpus from real Vox files (canonical flow, PowerShell)

.\target\release\vox.exe mens corpus extract examples/golden/ -o mens/data/validated.jsonl
.\target\release\vox.exe mens corpus extract docs/ -o mens/data/validated.jsonl 2>$null
.\target\release\vox.exe mens corpus validate mens/data/validated.jsonl --no-recheck -o mens/data/validated.jsonl
.\target\release\vox.exe mens corpus pairs mens/data/validated.jsonl -o target/dogfood/train.jsonl --docs docs/src/ --docs docs/src/research/ --docs docs/src/adr/
# Rustdoc merge skipped: response is Rust prose, not Vox code

3. Start Mens fine-tuning (canonical — Candle QLoRA, native Rust)

# Build with CUDA for RTX-class GPUs (see mens-training SSOT / AGENTS.md)
# Then minimal path:
.\target\release\vox.exe mens train --device cuda --data-dir target/dogfood --output-dir target/dogfood/run

Legacy Burn scratch (small VoxTokenizer model, wgpu — not HF QLoRA):

$env:VOX_BACKEND="cpu"; .\target\release\vox.exe train --data-dir target/dogfood --output-dir mens/runs/v1
# GPU: omit VOX_BACKEND=cpu when wgpu is available

4. Check eval gate

.\target\release\vox.exe mens corpus eval target/dogfood/train.jsonl -o mens/runs/v1/eval_results.json

Documentation → Training Pair Loop

Every documentation page with training_eligible: true in its frontmatter and a ```vox code block automatically contributes training pairs via vox mens corpus pairs --docs docs/src/.

This creates a closed feedback loop: better docs → more training data → better model → better completions → easier to write docs.

Frontmatter format for training-eligible docs:

---
title: "My Guide"
category: how-to
constructs: [function, workflow]
training_eligible: true
difficulty: intermediate
---

CI Integration

The ML pipeline runs automatically via .github/workflows/ml_data_extraction.yml:

  • Nightly: Full corpus re-extraction at 4 AM UTC
  • On push: Triggered when *.vox, compiler crates, or docs/src/** change
  • Manual: workflow_dispatch with force_train or native_train option
  • Grammar drift: Fingerprint check forces full re-extraction when syntax changes

CI training job (GPU runner)

The train job runs on a self-hosted GPU runner when corpus changes or when manually triggered:

  • Native path (default): Prefer vox mens train with VOX_BACKEND=cpu for CI compatibility. Older workflows may still invoke vox train; --provider local now bails with the canonical Candle QLoRA command (no Python train_qlora script).
  • Workflow_dispatch native_train: false: If still wired to vox train --provider local, expect the bail message directing operators to vox mens train --backend qlora. Use vox mens train directly in updated automation.
  • Eval strict mode: VOX_EVAL_STRICT=1 — training fails when eval gate thresholds are not met.
  • Benchmark gate: VOX_BENCHMARK=1 — runs held-out benchmark from mens/data/heldout_bench/; VOX_BENCHMARK_MIN_PASS_RATE (e.g. 0.80) fails promotion when pass rate is below threshold.
  • Artifact retention: LoRA adapter target/dogfood/run/ uploaded as lora-adapter-$VCS_SHA, retained 90 days. Eval results eval_results.json / eval_gate_failed.json retained 30 days.
  • Logging: Training pair count and eval gate result (parse rate, coverage) are printed; eval gate failure writes eval_gate_failed.json and emits a warning.

Runbook: Native training in CI

# CI uses VOX_BACKEND=cpu by default (no GPU drivers required)
VOX_BACKEND=cpu vox mens train --data-dir target/dogfood --output-dir target/dogfood/run

Runbook: Evol-Instruct (optional, gated)

Not wired on the current slim vox binary. Use external tooling or scripts until a corpus evol subcommand lands.

# Intended future shape (not implemented):
# EVOL_GATE=1 vox mens corpus evol …

Runbook: Optional extra corpus merge

Use vox mens corpus mix with mens/config/mix.yaml, or merge JSONL with your own tooling. There is no vox corpus merge subcommand today.

Train matrix (canonical)

ModeCommandWhen to use
Mens Candle QLoRA (primary)vox mens train --device cuda (defaults: --backend qlora, --tokenizer hf; optional --model <hf_repo>)Native qlora-rs + HF weights; CUDA/Metal feature builds; see mens-training.md
Qwen3.5-4B (4080 16GB)cargo build -p vox-cli --release --features gpu,mens-candle-cuda then vox mens train --preset qwen_4080_16g --device cuda …Preset path; full proxy stack defaults on CUDA unless --qlora-allow-partial-proxy-stack
Burn scratch LoRAvox train --data-dir … / VOX_BACKEND=cpuNot vox mens QLoRA — small VoxTokenizer model + wgpu/CPU in vox-tensor
vox mens train --backend loraRejected at runtimeUse --backend qlora for Mens dispatch (SSOT)
Legacy vox train (mens-dei)vox train …--provider local → bail message → vox mens train --backend qlora; Together remote; --native Burn-only scratch
CI strictVOX_EVAL_STRICT=1Fail promotion on eval gate failure
CI benchmarkVOX_BENCHMARK=1Run held-out benchmark before promotion

Artifact layout: target/dogfood/train.jsonl (canonical input), target/dogfood/run/ (output). Version naming: lora-adapter-$VCS_SHA, eval-gate-$VCS_SHA.


Next Steps