"How To: Train Mens on RTX 4080 Super"

How To: Train Mens on RTX 4080 Super

Canonical contracts, backends, and regression commands: Mens native training SSOT. This page is a step-by-step runbook for RTX 4080 Super; do not duplicate SSOT tables here.

This runbook covers two native paths:

  1. Production Qwen 3.5 (recommended for Qwen3.5-4B-Instruct)Candle QLoRA (--backend qlora, NF4 frozen bases via qlora-rs). Build with mens-candle-cuda on Windows/Linux when you have an NVIDIA GPU and CUDA toolkit available for candle-core.
  2. Burn LoRA (GPT-2-shaped HF or Vox tokenizer) — default vox mens train without --backend qlora; uses wgpu (Vulkan/DX12) on Windows.
  • Build (CUDA): from repo root, cargo vox-cuda-release (alias in .cargo/config.toml — same as cargo build -p vox-cli --release --features gpu,mens-candle-cuda).

    [!WARNING] On Windows, you MUST use an interactive VS Developer Command Prompt or PowerShell shell explicitly bootstrapped with vcvars64.bat. Passing vcvars64.bat via nested subshells (e.g. cmd.exe /c "vcvars64.bat && cargo...") aggressively drops the PATH configurations preventing nvcc from correctly executing cl.exe.

  • Data: target/dogfood/train.jsonl (from corpus pairs/mix); optional record_format: tool_trace in mix for command/tool supervision rows (category tool_trace). See mens/schemas/tool_trace_record.schema.json and mens/data/tool_traces.example.jsonl.
  • Train:
    .\target\release\vox.exe mens train `
      --backend qlora --tokenizer hf `
      --preset qwen_4080_16g `
      --model Qwen/Qwen3.5-4B `
      --data-dir target/dogfood `
      --output-dir mens/runs/qwen35_qlora `
      --device cuda `
      --qlora-require-full-proxy-stack
    
    --qlora-require-full-proxy-stack is recommended for strict shard completeness on native qwen3_5 runs. LM-head-only mode is currently deferred/not implemented in the native trainer.
  • Artifacts: candle_qlora_adapter.safetensors, candle_qlora_adapter_meta.json, populi_adapter_manifest_v3.json, training_manifest.json, telemetry.jsonl.

Go-live checklist (local CUDA dogfood)

  1. Shell: VS Developer / MSVC environment so cargo vox-cuda-release (or cargo check -p vox-cli --features gpu,mens-candle-cuda) succeeds.
  2. CLI: vox mens train --help lists --qlora-* flags including --qlora-ce-last-k.
  3. Corpus: refresh train.jsonl or set VOX_TRAIN_SKIP_CORPUS_MIX=1 when the mix step is unnecessary.
  4. Run: canonical QLoRA command from above with --log-dir mens/runs/logs (or your path); tail the log.
  5. Acceptance: first log lines show finite loss; optional --qlora-ce-last-k 4 for a stronger suffix LM signal (see SSOT).
  6. Thin wrapper (optional): scripts/populi/dogfood_qlora_cuda.ps1.
  • Merge (Candle): in-tree vox mens merge-qlora (alias merge-adapter) or vox schola merge-qlora — same merge surface; produces f32 safetensors subsets — not Burn *.bin. See the SSOT train → merge → serve table in mens-training.md. vox mens serve (Burn) loads LoRA or merged Burn checkpoints; it does not load Candle merge-qlora safetensors. For querying merged QLoRA weights, use an external stack (e.g. export to HF/Ollama) or keep the adapter path your inference tool supports.

Burn LoRA path (non-Qwen or GPT-2-shaped HF)

  • Default: vox mens train --data-dir target/dogfood --output-dir mens/runs/v1
  • Input contract: target/dogfood/train.jsonl
  • Backend: wgpu on Windows (Vulkan or DX12); no CUDA required for Burn

Prerequisites

  1. Build Vox CLI (release binary):
    & "$env:USERPROFILE\.cargo\bin\cargo.exe" build -p vox-cli --release
    
  2. Generate canonical corpus input:
    New-Item -ItemType Directory -Force -Path mens/data,target/dogfood | Out-Null
    .\target\release\vox.exe mens corpus extract examples/ -o mens/data/validated.jsonl
    .\target\release\vox.exe mens corpus extract docs/ -o mens/data/validated.jsonl 2>$null
    .\target\release\vox.exe mens corpus validate mens/data/validated.jsonl --no-recheck -o mens/data/validated.jsonl
    .\target\release\vox.exe mens corpus pairs mens/data/validated.jsonl -o target/dogfood/train.jsonl --docs docs/src/ --docs docs/src/research/ --docs docs/src/adr/
    # Rustdoc merge skipped: response is Rust prose, not Vox code
    
  3. Optional Burn GPU backend selection (passed to vox mens train --device; best is default):
    # Prefer flags on the train command, not legacy env, for `vox mens train`:
    # --device best | vulkan | dx12 | cpu
    
  4. Optional training profile (RTX 4080 Super 16GB VRAM):
    $env:VOX_TRAIN_PROFILE = "safe"   # Conservative: batch 2, seq 256 (shared GPU, avoids OOM)
    # $env:VOX_TRAIN_PROFILE = "balanced"  # Default for 16GB: batch 4, seq 512, rank 16
    # $env:VOX_TRAIN_PROFILE = "throughput" # Aggressive: batch 6 (may OOM if OS uses GPU)
    
    Device probe auto-detects 16GB and recommends batch 4, seq 512, rank 16. Use vox mens probe to verify.

Full mixed corpus → entire LoRA run (4080 preset)

Use this when you want all sources from mens/config/mix.yaml (not a tiny dogfood slice).

  1. Build release CLI with --features gpu (default is mens-base only; native train / QLoRA need the GPU feature stack). Add --features mens-dei only if you need legacy vox train (Together / --native Burn scratch; --provider local bails to vox mens train) or Mens DeI surfaces (generate, review, …):

    & "$env:USERPROFILE\.cargo\bin\cargo.exe" build -p vox-cli --release --features gpu
    

    If this fails, fix vox-cli compile errors before training.

  2. Mix into the default mix output path (strict: all non-optional sources must exist and contribute rows):

    .\target\release\vox.exe mens corpus mix --config mens/config/mix.yaml
    

    Writes target/dogfood/train_mixed.jsonl per mix config plus target/dogfood/train_mixed.mix_report.json. If your tree is missing generated files, use --allow-missing-sources once (same as legacy warn-only mix) or run the corpus pipeline stages first.

  3. Point training at that file as train.jsonl (preflight requires this exact name inside --data-dir):

    New-Item -ItemType Directory -Force -Path target/dogfood | Out-Null
    Copy-Item -Force target/dogfood/train_mixed.jsonl target/dogfood/train.jsonl
    
  4. Train (Qwen + Candle QLoRA) with the qwen_4080_16g preset (16GB-oriented; see SSOT mens-training.md):

    .\target\release\vox.exe mens train `
      --backend qlora --tokenizer hf `
      --preset qwen_4080_16g `
     --model Qwen/Qwen3.5-4B `
      --data-dir target/dogfood `
      --output-dir mens/runs/rtx4080_full `
      --device cuda `
      --background
    

    --background alone attaches logs under mens/runs/logs (repo root when detected) and returns immediately; equivalent to --log-dir mens/runs/logs. On Windows the child process is spawned with breakaway-from-job flags to reduce IDE teardown killing the trainer. Tail: Get-Content mens/runs/logs/train_*.log -Wait -Tail 25. Alternatives: vox mens train … --background, or pwsh scripts/populi/release_training_gate.ps1 only for CI gates (not full training).

    On OOM, use --preset safe / 4080_safe, lower --seq-len, raise --grad-accum, lower --rank, or set VOX_CANDLE_DEVICE=cpu (slow).

First Training Run (Native)

.\target\release\vox.exe mens train --data-dir target/dogfood --output-dir mens/runs/v1

Or run the end-to-end automation script:

.\scripts\run_mens_pipeline.ps1 -DataDir target/dogfood -OutputDir mens/runs/v1 -Backend vulkan

Expected outputs:

  • mens/runs/v1/model_final.bin
  • mens/runs/v1/checkpoint_epoch_*.bin
  • mens/runs/v1/eval_results.json
  • mens/runs/v1/benchmark_results.json (if benchmark gate enabled)

Quality Gates

  • Eval thresholds:
    • VOX_EVAL_MIN_PARSE_RATE (default 0.80)
    • VOX_EVAL_MIN_COVERAGE (default 0.60)
  • Strict enforcement:
    • VOX_EVAL_STRICT=1 to fail run on threshold miss
  • Optional held-out benchmark (build with --features mens-dei; paths via env):
    • VOX_BENCHMARK=1 — after training, spawns vox mens eval-local
    • VOX_BENCHMARK_MODEL — checkpoint path (else auto-detect under output dir)
    • VOX_BENCHMARK_DIR — held-out bench directory (default mens/data/heldout_bench)
.\target\release\vox.exe mens corpus eval target/dogfood/train.jsonl -o mens/runs/v1/eval_results.json

Runtime Profiles

  • Fast dogfood:
    • 1 epoch, smaller dataset while iterating on pipeline code/docs
  • Full run:
    • Full corpus + rustdoc merge and benchmark gate enabled

Model Card

After training, the model card is rendered from mens/model_card/:

uv run --project scripts render-model-card --run-dir mens/runs/v1

Dogfood operator checklist (real corpus, 4080 QLoRA)

Use this before claiming a full dogfood run is complete (CI cannot substitute for your GPU box).

Cursor / agents: full vox ci mens-gate can exceed tool timeouts — use pwsh scripts/populi/release_training_gate.ps1 -Detach and tail target/mens-gate-logs/ (see mens-training.md).

  1. Corpus: mens corpus mix --config mens/config/mix.yaml → copy/rename to target/dogfood/train.jsonl (preflight requires that filename in --data-dir).
  2. Build: cargo vox-cuda-release natively from a vcvars64.bat loaded interactive terminal (nvcc relies on absolute discovery and crashes in subshells).
  3. Train: vox mens train --backend qlora --tokenizer hf --preset qwen_4080_16g (or --preset 4080, same profile) + --model, --data-dir, --output-dir, --device cuda; keep --qlora-require-full-proxy-stack on for strict native shard completeness.
  4. Artifacts: Confirm candle_qlora_adapter.safetensors, candle_qlora_adapter_meta.json, populi_adapter_manifest_v3.json, training_manifest.json, telemetry.jsonl under the output dir.
  5. Merge / serve: Candle merge is vox schola merge-qlora (f32 shard subsets); vox mens serve stays Burn-only — see SSOT Merge / export.
  6. Optional automation: scripts/populi/dogfood_qlora_cuda.ps1 builds (CUDA by default) and launches the canonical CLI in the background; see scripts/README.md.

See Also