How To: Train Mens on RTX 4080 Super
Canonical contracts, backends, and regression commands: Mens native training SSOT. This page is a step-by-step runbook for RTX 4080 Super; do not duplicate SSOT tables here.
This runbook covers two native paths:
- Production Qwen 3.5 (recommended for Qwen3.5-4B-Instruct) — Candle QLoRA (
--backend qlora, NF4 frozen bases via qlora-rs). Build withmens-candle-cudaon Windows/Linux when you have an NVIDIA GPU and CUDA toolkit available forcandle-core. - Burn LoRA (GPT-2-shaped HF or Vox tokenizer) — default
vox mens trainwithout--backend qlora; uses wgpu (Vulkan/DX12) on Windows.
Recommended Path (Qwen3.5-4B, RTX 4080-class 16GB)
- Build (CUDA): from repo root,
cargo vox-cuda-release(alias in.cargo/config.toml— same ascargo build -p vox-cli --release --features gpu,mens-candle-cuda).[!WARNING] On Windows, you MUST use an interactive VS Developer Command Prompt or PowerShell shell explicitly bootstrapped with
vcvars64.bat. Passingvcvars64.batvia nested subshells (e.g.cmd.exe /c "vcvars64.bat && cargo...") aggressively drops the PATH configurations preventingnvccfrom correctly executingcl.exe. - Data:
target/dogfood/train.jsonl(from corpus pairs/mix); optionalrecord_format: tool_tracein mix for command/tool supervision rows (categorytool_trace). Seemens/schemas/tool_trace_record.schema.jsonandmens/data/tool_traces.example.jsonl. - Train:
.\target\release\vox.exe mens train ` --backend qlora --tokenizer hf ` --preset qwen_4080_16g ` --model Qwen/Qwen3.5-4B ` --data-dir target/dogfood ` --output-dir mens/runs/qwen35_qlora ` --device cuda ` --qlora-require-full-proxy-stack--qlora-require-full-proxy-stackis recommended for strict shard completeness on native qwen3_5 runs. LM-head-only mode is currently deferred/not implemented in the native trainer. - Artifacts:
candle_qlora_adapter.safetensors,candle_qlora_adapter_meta.json,populi_adapter_manifest_v3.json,training_manifest.json,telemetry.jsonl.
Go-live checklist (local CUDA dogfood)
- Shell: VS Developer / MSVC environment so
cargo vox-cuda-release(orcargo check -p vox-cli --features gpu,mens-candle-cuda) succeeds. - CLI:
vox mens train --helplists--qlora-*flags including--qlora-ce-last-k. - Corpus: refresh
train.jsonlor setVOX_TRAIN_SKIP_CORPUS_MIX=1when the mix step is unnecessary. - Run: canonical QLoRA command from above with
--log-dir mens/runs/logs(or your path); tail the log. - Acceptance: first log lines show finite loss; optional
--qlora-ce-last-k 4for a stronger suffix LM signal (see SSOT). - Thin wrapper (optional):
scripts/populi/dogfood_qlora_cuda.ps1.
- Merge (Candle): in-tree
vox mens merge-qlora(aliasmerge-adapter) orvox schola merge-qlora— same merge surface; produces f32 safetensors subsets — not Burn*.bin. See the SSOT train → merge → serve table inmens-training.md.vox mens serve(Burn) loads LoRA or merged Burn checkpoints; it does not load Candle merge-qlora safetensors. For querying merged QLoRA weights, use an external stack (e.g. export to HF/Ollama) or keep the adapter path your inference tool supports.
Burn LoRA path (non-Qwen or GPT-2-shaped HF)
- Default:
vox mens train --data-dir target/dogfood --output-dir mens/runs/v1 - Input contract:
target/dogfood/train.jsonl - Backend:
wgpuon Windows (Vulkan or DX12); no CUDA required for Burn
Prerequisites
- Build Vox CLI (release binary):
& "$env:USERPROFILE\.cargo\bin\cargo.exe" build -p vox-cli --release - Generate canonical corpus input:
New-Item -ItemType Directory -Force -Path mens/data,target/dogfood | Out-Null .\target\release\vox.exe mens corpus extract examples/ -o mens/data/validated.jsonl .\target\release\vox.exe mens corpus extract docs/ -o mens/data/validated.jsonl 2>$null .\target\release\vox.exe mens corpus validate mens/data/validated.jsonl --no-recheck -o mens/data/validated.jsonl .\target\release\vox.exe mens corpus pairs mens/data/validated.jsonl -o target/dogfood/train.jsonl --docs docs/src/ --docs docs/src/research/ --docs docs/src/adr/ # Rustdoc merge skipped: response is Rust prose, not Vox code - Optional Burn GPU backend selection (passed to
vox mens train --device;bestis default):# Prefer flags on the train command, not legacy env, for `vox mens train`: # --device best | vulkan | dx12 | cpu - Optional training profile (RTX 4080 Super 16GB VRAM):
Device probe auto-detects 16GB and recommends batch 4, seq 512, rank 16. Use$env:VOX_TRAIN_PROFILE = "safe" # Conservative: batch 2, seq 256 (shared GPU, avoids OOM) # $env:VOX_TRAIN_PROFILE = "balanced" # Default for 16GB: batch 4, seq 512, rank 16 # $env:VOX_TRAIN_PROFILE = "throughput" # Aggressive: batch 6 (may OOM if OS uses GPU)vox mens probeto verify.
Full mixed corpus → entire LoRA run (4080 preset)
Use this when you want all sources from mens/config/mix.yaml (not a tiny dogfood slice).
-
Build release CLI with
--features gpu(default ismens-baseonly; native train / QLoRA need the GPU feature stack). Add--features mens-deionly if you need legacyvox train(Together /--nativeBurn scratch;--provider localbails tovox mens train) or Mens DeI surfaces (generate,review, …):& "$env:USERPROFILE\.cargo\bin\cargo.exe" build -p vox-cli --release --features gpuIf this fails, fix
vox-clicompile errors before training. -
Mix into the default mix output path (strict: all non-optional sources must exist and contribute rows):
.\target\release\vox.exe mens corpus mix --config mens/config/mix.yamlWrites
target/dogfood/train_mixed.jsonlper mix config plustarget/dogfood/train_mixed.mix_report.json. If your tree is missing generated files, use--allow-missing-sourcesonce (same as legacy warn-only mix) or run the corpus pipeline stages first. -
Point training at that file as
train.jsonl(preflight requires this exact name inside--data-dir):New-Item -ItemType Directory -Force -Path target/dogfood | Out-Null Copy-Item -Force target/dogfood/train_mixed.jsonl target/dogfood/train.jsonl -
Train (Qwen + Candle QLoRA) with the
qwen_4080_16gpreset (16GB-oriented; see SSOT mens-training.md):.\target\release\vox.exe mens train ` --backend qlora --tokenizer hf ` --preset qwen_4080_16g ` --model Qwen/Qwen3.5-4B ` --data-dir target/dogfood ` --output-dir mens/runs/rtx4080_full ` --device cuda ` --background--backgroundalone attaches logs undermens/runs/logs(repo root when detected) and returns immediately; equivalent to--log-dir mens/runs/logs. On Windows the child process is spawned with breakaway-from-job flags to reduce IDE teardown killing the trainer. Tail:Get-Content mens/runs/logs/train_*.log -Wait -Tail 25. Alternatives:vox mens train … --background, orpwsh scripts/populi/release_training_gate.ps1only for CI gates (not full training).On OOM, use
--preset safe/4080_safe, lower--seq-len, raise--grad-accum, lower--rank, or setVOX_CANDLE_DEVICE=cpu(slow).
First Training Run (Native)
.\target\release\vox.exe mens train --data-dir target/dogfood --output-dir mens/runs/v1
Or run the end-to-end automation script:
.\scripts\run_mens_pipeline.ps1 -DataDir target/dogfood -OutputDir mens/runs/v1 -Backend vulkan
Expected outputs:
mens/runs/v1/model_final.binmens/runs/v1/checkpoint_epoch_*.binmens/runs/v1/eval_results.jsonmens/runs/v1/benchmark_results.json(if benchmark gate enabled)
Quality Gates
- Eval thresholds:
VOX_EVAL_MIN_PARSE_RATE(default0.80)VOX_EVAL_MIN_COVERAGE(default0.60)
- Strict enforcement:
VOX_EVAL_STRICT=1to fail run on threshold miss
- Optional held-out benchmark (build with
--features mens-dei; paths via env):VOX_BENCHMARK=1— after training, spawnsvox mens eval-localVOX_BENCHMARK_MODEL— checkpoint path (else auto-detect under output dir)VOX_BENCHMARK_DIR— held-out bench directory (defaultmens/data/heldout_bench)
.\target\release\vox.exe mens corpus eval target/dogfood/train.jsonl -o mens/runs/v1/eval_results.json
Runtime Profiles
- Fast dogfood:
- 1 epoch, smaller dataset while iterating on pipeline code/docs
- Full run:
- Full corpus + rustdoc merge and benchmark gate enabled
Model Card
After training, the model card is rendered from mens/model_card/:
uv run --project scripts render-model-card --run-dir mens/runs/v1
Dogfood operator checklist (real corpus, 4080 QLoRA)
Use this before claiming a full dogfood run is complete (CI cannot substitute for your GPU box).
Cursor / agents: full vox ci mens-gate can exceed tool timeouts — use pwsh scripts/populi/release_training_gate.ps1 -Detach and tail target/mens-gate-logs/ (see mens-training.md).
- Corpus:
mens corpus mix --config mens/config/mix.yaml→ copy/rename totarget/dogfood/train.jsonl(preflight requires that filename in--data-dir). - Build:
cargo vox-cuda-releasenatively from avcvars64.batloaded interactive terminal (nvccrelies on absolute discovery and crashes in subshells). - Train:
vox mens train --backend qlora --tokenizer hf --preset qwen_4080_16g(or--preset 4080, same profile) +--model,--data-dir,--output-dir,--device cuda; keep--qlora-require-full-proxy-stackon for strict native shard completeness. - Artifacts: Confirm
candle_qlora_adapter.safetensors,candle_qlora_adapter_meta.json,populi_adapter_manifest_v3.json,training_manifest.json,telemetry.jsonlunder the output dir. - Merge / serve: Candle merge is
vox schola merge-qlora(f32 shard subsets);vox mens servestays Burn-only — see SSOT Merge / export. - Optional automation:
scripts/populi/dogfood_qlora_cuda.ps1builds (CUDA by default) and launches the canonical CLI in the background; see scripts/README.md.
See Also
- Native ML Training Pipeline
- How To: Publish Mens to Hugging Face
- scripts/README.md — thin delegates + optional RTX 4080 QLoRA helper script