Mens native training SSOT (Candle QLoRA–first)
VoxMens quick start
With train.jsonl under the default training data directory (see vox_corpus / mix SSOT), the minimal operator path is:
vox mens train --device cuda
--backend qlora and --tokenizer hf are already the CLI defaults. When --model is omitted on the Candle QLoRA path, the base model defaults to the SSOT id Qwen/Qwen3.5-4B (vox_populi::mens::DEFAULT_MODEL_ID, mirrored in contracts/mens/training-presets.v1.yaml as default_base_model). Add --output-dir <dir> to place run artifacts. On CUDA, the full QLoRA proxy stack is required by default; use --qlora-allow-partial-proxy-stack only when you accept partial-stack semantics. For multi-model fine-tuning, pass an explicit --model <hf/repo>.
Tokenization SSOT
- Candle QLoRA (
vox mens train --backend qlora, default): supervision strings are encoded with the Hugging Face tokenizer shipped for--model(seevox_populi::mens::tensor::training_text::hf_tokenize_chatml_supervisedandChatmlConfigim_start/im_end aliases). That vocabulary is model-defined (tens of thousands of BPE tokens), not the small constant invox-tensor. vox_tensor::data::VoxTokenizer: a deterministic lab / legacy-Burn harness: printable ASCII byte ids plus a minimal compound tier for ChatML delimiters and markdown code fences. It does not track the Vox lexer keyword set and must not be treated as a language mirror.- Dogfood tiny transformers (
VOCAB_SIZEin manifests): use this lab vocab size only for in-repo scratch models — not for Qwen-class fine-tunes.
Generated defaults snapshot: Mens train defaults (generated).
Code SSOT:
vox mens traindispatches throughvox_populi::mens::tensor::run_mens_training(lora_train.rs).PopuliTrainBackend::BurnLorais rejected at runtime with an explicit error; the supported native trainer isCandleQlora(--backend qlora,--tokenizer hffor HF-shaped models).vox mens serve(local,cloud=local) delegates tovox-schola servefor QLoRA run directories — not the Burnexecution-apibinary. Treat Burnmerge-weights+execution-apiserve as a separate, legacy in-tree lane. See Mens local serving SSOT (Schola + orchestrator).
Truth tables (train → merge → serve)
| Path | Train (CLI) | Merge | Serve in-tree |
|---|---|---|---|
| Candle QLoRA | vox mens train --backend qlora --tokenizer hf … | vox mens merge-qlora / vox schola merge-qlora (alias merge-adapter) → f32 subset shards (optional external vLLM/Ollama/HF) | Yes (local) — vox-schola serve --model <run_dir> or vox mens serve --model <run_dir> (OpenAI + Ollama-shaped HTTP, including /api/generate for Ludus/orchestrator). Merged safetensors subset is not loaded by Schola. |
| Burn LoRA | Not via schola train dispatch (use historical/legacy flows if you still maintain Burn checkpoints) | vox mens merge-weights → model_merged.bin | Yes — vox mens serve with execution-api + gpu: Burn checkpoints (*.bin / merged). This is not the QLoRA vox-schola path above. |
External serving is a supported lane
For Candle QLoRA merged artifacts and multi-node deploys, external runtimes remain first-class.
- Treat vLLM, Ollama.app, HF Transformers, and OpenAI-compatible gateways as deployment targets for merged QLoRA outputs and for teams that do not run Schola.
- Training and merge write
external_serving_handoff_v1.json(schema:contracts/eval/external-serving-handoff.schema.json) next to artifacts for automation. - Local dev default: Schola on a chosen port +
POPULI_URL/POPULI_MODEL— Mens local serving SSOT.
Why
- One canonical CLI for in-repo native fine-tuning:
vox mens train. - Contract-first control plane (in
vox-populi::mens::tensor):FineTuneContract+ExecutionPlanner+preflight_traingate impossible combos before kernels run (finetune_contract.rs,execution_planner.rs,preflight_train.rs). Preflight output schema (F04, extend alongside code):contracts/mens/training-preflight.schema.json. After a successfulpreflight_for_contractinsiderun_mens_training, the trainer writestraining-preflight.jsonnext to run artifacts when an output directory is set (fields:schema_version,contract_digest,execution_kernel, optionalnotes). Capability table: hf-finetune-capability-matrix.md. Gap labels: hf-finetune-gap-matrix.md. - Honest execution-kernel split:
- Burn + wgpu LoRA (
--backend lora): defaultVoxTokenizerJSONL; optional--tokenizer hffor GPT-2-shaped HF configs + ChatML-supervised HF tokenization + optional embed warm-start (burn_hf_load.rs). Not NF4 QLoRA. - Candle + qlora-rs (
--backend qlora,--tokenizer hf): NF4-quantized full-graph training over loaded decoder blocks with trainable LoRA adapters. Current trainer path is full graph only (LM-head-only/partial-depth flags are parsed for contract compatibility but rejected at runtime). Context embeddings stay mmapf32(index_select). Same--devicestory: CUDA / Metal withmens-candle-cuda/mens-candle-metal, else CPU;VOX_CANDLE_DEVICE=cpuforces CPU. Telemetry includesexecution_kernel,telemetry_schema, andcandle_compat_modefor cutover observability.
- Burn + wgpu LoRA (
- Remaining gaps (explicit): full causal NF4 blocks in Candle (see candle-full-graph-feasibility.md); Burn
LoraAttention::mergerequiresuse_rope == false(GPT-2-style); RoPE stacks must stay unmerged or use native LoRA modules at serve time. Double quant:QLoraConfig.quantization.double_quantdefaults on; CLI--qlora-no-double-quantdisables for ablation. See ADR 006 (full-graph) and ADR 007 (API gate). - GPU visibility (Burn): stderr +
burn_wgpu_deviceundervox_mens_gpu. - CI / CUDA: When
nvccis onPATH, CI runsscripts/check_cuda_feature_builds.sh. Seeci/runner-contract.md.
Provenance and trajectory metadata (2026 update)
MENS run artifacts now treat lineage and trajectory policy as explicit metadata:
- Provenance fields (contract + manifest):
- upstream family id,
- upstream model id,
- license class,
- attribution-required flag.
- Trajectory-weighting fields (config + telemetry semantics):
- optional weighting toggle for tool-trace style rows,
- optional boost for failure/error categories,
- optional quality floor and quality boost.
- Experimental optimizer lane:
optimizer_experiment_modedefaults tooff,- non-default modes require
VOX_MENS_EXPERIMENTAL_OPTIMIZER=1.
These defaults remain conservative and do not change baseline behavior unless enabled.
Context and source-strength notes for Composer/Kimi findings are documented in
../architecture/mens-composer-kimi-findings-2026.md.
finetune_contract_digest scope
finetune_contract_digest is a reproducibility fingerprint for planner-relevant training semantics. Current scope includes:
- model/config/tokenizer file identity used by the contract,
- quantization and adapter method knobs,
- tokenizer mode and selected QLoRA behavior gates,
- provenance metadata fields (
base_family,upstream_model_id,license_class,attribution_required).
It intentionally excludes runtime-only telemetry counters and post-hoc eval outcomes.
What (surfaces)
| Piece | Role |
|---|---|
vox-cli vox mens train | Compile: cargo build -p vox-cli --features gpu (default features are mens-base only). Operational default: --backend qlora --tokenizer hf (Candle QLoRA). Legacy --backend lora is deprecated and retained only for compatibility context. Mobile edge export: --deployment-target mobile_edge or --preset mobile_edge → planner gates + --device cpu required; see mobile-edge-ai.md. |
vox-cli vox mens serve | cloud=local: delegates to vox-schola serve (QLoRA run directory; gpu). Burn HTTP for *.bin / merge-weights is the separate execution-api Axum server when that feature is enabled. SSOT: mens-serving-ssot.md. |
vox-populi PopuliTrainBackend | Enum + FromStr / serde in crates/vox-populi/src/mens/tensor/train_backend.rs. |
vox-populi TrainingBackend | Trait in tensor/backend.rs; Candle implementation in tensor/backend_candle_qlora.rs + tensor/candle_qlora_train modules. |
vox-populi run_mens_training | Dispatch in tensor/lora_train.rs with contract/planner/preflight gates. |
vox-populi LoraTrainingConfig | tensor/training_config.rs (MensTokenizerMode, provenance/trajectory knobs). |
vox train | Legacy: --provider local spawns vox mens train with --data-mode strict (stale fingerprint → blocking refresh, then train) and a default 4080-class QLoRA recipe (see crates/vox-cli/src/commands/ai/train.rs). --native uses the old Burn scratch trainer when built with mens-dei. Together remote unchanged. |
vox mens train-uv | Retired — bails; use vox mens train --backend qlora. |
vox-schola train | When vox is discoverable (VOX_EXE, sibling of vox-schola, or PATH), train forwards to vox mens train with the same QLoRA flags (set VOX_SCHOLA_FORWARD=never to run the standalone schola trainer; VOX_SCHOLA_FORWARD=always requires vox). |
Training data mode (--data-mode)
strict: if the corpus fingerprint is stale,train_armruns the same refresh asauto-refresh(synthetic regen,vox mens pipelinewith train skipped, mix copy) before training; any refresh step failure aborts. Use for CI, release gates, and reproducible local runs.auto-refresh(default): when stale, run that refresh path but log warnings for non-fatal failures and may still proceed to training (still respectsVOX_TRAIN_SKIP_CORPUS_MIX).
Preset id SSOT (parity-tested vs Rust KNOWN_PRESETS): contracts/mens/training-presets.v1.yaml.
Data prep orchestration (SSOT)
- Mix + train input:
vox_corpus::training::mix_prepare— refreshmens/config/mix.yaml, optional sync ofdata_dir/train.jsonlinto the mix primary source path (workspace-relative), resolve mixed output relative to workspace root (not mutable CWD). Used byvox mens train(schola/train/gpu.rs),vox-schola train(or forwardedvox mens train), and the Mix stage ofvox mens pipeline. - Pipeline / stale-regen: after a stale fingerprint is detected (both modes, unless
VOX_TRAIN_SKIP_CORPUS_MIX/ skip env applies),train_armruns pipeline +copy_mix_output_to_train_jsonland may setVOX_TRAIN_SKIP_CORPUS_MIX=1.strictrequires the refresh path to succeed;auto-refreshtolerates some failures with stderr warnings. - Hugging Face base weights:
vox_populi::mens::hub::download_model_blocking— shared blocking download used by CLI GPU train andvox-schola train(same behavior as the previous per-call-siteRuntime::block_onthreads). - Normative CLI for operators:
vox mens train;vox-scholadefaults to forwarding intovoxwhen present (see table above).
Documentation corpus lane
Documentation extraction exists, but keep the current boundaries explicit:
vox mens pipelineextractsdocs/srcintomens/data/mix_sources/docs.jsonl.crates/vox-corpus/src/corpus/extract_docs.rscan emit both code-oriented rows and prose Q&A rows.- The default production mix in
mens/config/mix.yamlremainsvox_codegen-only. - That means VoxMens is still primarily a code-oriented training path today, not a general architecture-question answering system.
- Documentation metadata and traceability are being carried forward so later opt-in docs-QA or retrieval paths can cite exact source pages and headings without changing the default production lane.
Research (corpus lab, vision, Qwen family): Vox corpus lab (research 2026), Mens vision and multimodal inputs (research 2026), Mens Qwen family migration (research 2026).
Who / when
- Implementers:
vox-populi(mens::tensor,mens::hub),vox-cli(commands/schola/train/*,commands/mens/populi/*,commands/mens/pipeline.rs),vox-schola(src/train.rs), corpus preflight + mix (vox-corpus::training,vox-corpus::training::mix_prepare). - When to touch: training knobs, telemetry keys, CLI flags, qlora-rs / Candle versions, merge/export behavior, or corpus/mix/train-input resolution.
Where (files)
crates/vox-populi/src/mens/tensor/train_backend.rs— CLI/backend enum (PopuliTrainBackend) + execution kernelcrates/vox-populi/src/mens/tensor/finetune_contract.rs—FineTuneContract, provenance, digestcrates/vox-populi/src/mens/tensor/execution_planner.rs— planner + hard gatescrates/vox-populi/src/mens/tensor/preflight_train.rs— shared preflight entrycrates/vox-populi/src/mens/tensor/hf_keymap.rs— shared HF weight key mapscrates/vox-populi/src/mens/tensor/training_text.rs— prompt / ChatML text policycrates/vox-populi/src/mens/tensor/telemetry_schema.rs— stable telemetry keyscrates/vox-populi/src/mens/tensor/adapter_schema_v3.rs— adapter manifest v3 + merge bridgecrates/vox-populi/src/mens/tensor/training_config.rs—LoraTrainingConfigcrates/vox-populi/src/mens/tensor/backend.rs—TrainingBackendtraitcrates/vox-populi/src/mens/tensor/backend_candle_qlora.rs— Candle qlora-rs entrycrates/vox-populi/src/mens/tensor/candle_qlora_train/*— trainer graph, loop, checkpointscrates/vox-populi/src/mens/tensor/train_log.rs—[mens-train]stderr + fallback notescrates/vox-populi/src/mens/tensor/qlora_preflight.rs— HF safetensors + tokenizer checkscrates/vox-populi/src/mens/tensor/operator_messages.rs— shared operator error stringscrates/vox-populi/src/mens/tensor/lora_train.rs—run_mens_trainingcrates/vox-cli/src/commands/mens/mod.rs—--backendCLI mappingcrates/vox-cli/src/commands/schola/train.rs—run_train→run_mens_trainingcrates/vox-schola/src/train.rs— standalonevox-schola trainQLoRA pathcrates/vox-cli/src/commands/mens/mod.rs—train-uvretired (inline bail; usevox mens train --backend qlora)crates/vox-corpus/src/training/mix_prepare.rs— Mens mix + primary-source sync + copy helpers (workspace-root SSOT)crates/vox-populi/src/mens/hub.rs—download_model_blocking(HF snapshot for training)AGENTS.md§ 2.2.3,docs/src/reference/cli.md(Mens),docs/src/expl-ml-pipeline.md(train matrix)- Plans:
.cursor/plans/native_qlora_ssot_dea968e4.plan.md,.cursor/plans/qlora_ssot_grounded_plan_cc5501f2.plan.md
Full-graph QLoRA design (Phase 2c)
Architecture gate (2026-03): ADR 007 records the qlora-rs API surface audit used by the native trainer. Keep this ADR in sync with any future trainer graph changes.
HF layout: vox_mens::tensor::hf_load::HfTransformerLayout parses config.json (model_type, architectures, hidden_size, num_attention_heads, num_hidden_layers, vocab_size) for Llama/Mistral/Qwen-style and GPT-2-shaped configs. qlora_preflight checks hidden_size matches the embedding tensor width discovered in safetensors.
How (contracts)
- Build:
cargo check -p vox-populi --features mens-train(pulls qlora-rs + candle trainer path). Optional CUDA lane:--features mens-train,mens-candle-qlora-cuda.[!IMPORTANT] Windows MSVC/NVCC constraint: Building the CUDA
candle-kernelscompletely fails if executed through a nested subshell (e.g.cmd.exe /c "vcvars64.bat && cargo build"). The innerbindgen_cudaexecutable natively drops nested path states, leading to an immediate'cl.exe' is not recognizedfailure. You must interactively open the VS Developer Command Prompt or physically runvcvars64.batin your persistent PowerShell window before typing cargo commands for CUDA. - Workspace deps: root
[workspace.dependencies]qlora-rspin must stay aligned withvox-populioptional deps. Keep notes inVOX_PATCH.mdsynchronized with whichever qlora-rs patches are active for trainer stability. - Input:
train.jsonl(andmens/config/training_contract.yaml/ preflight overrides). - Telemetry:
train_startincludestrain_backend: "burn_lora"or"candle_qlora". Candle QLoRAtrain_startalso recordsepochs,planned_steps_per_epoch,planned_steps_total(upper bound if no vocab/hidden skips). Progress logs (~5s):ETA_smoothed≈…from an interval throughput EMA (after step 24), plus step/s and % of planned — no duplicatestep 20/40/…log lines (those aretelemetry.jsonlonly).steprows addsteps_per_sec_ema,eta_seconds_remaining(EMA-based),progress_fraction.train_complete:wall_seconds,mean_steps_per_sec. Seetelemetry_schemakeys. VoxDB persistence usesVoxDb::connect_defaultwithDbConfig::resolve_canonical; a legacy primary yieldsLegacySchemaChainuntil migration — see how-to-voxdb-canonical-store.
Training objective mismatch (Burn vs Candle)
- Burn (
--backend lora) { full-graph f32 causal LM on wgpu (or NdArray in tests). Objective = standard next-token CE over the whole decoder graph you enabled. - Candle (
--backend qlora): NF4 frozen bases via qlora-rs with a full-forward training graph over loaded decoder blocks; loss is masked next-token CE on supervised suffix positions (--qlora-ce-last-k). - Operator impact: do not expect loss / perplexity curves to match Burn. Use
training_manifest.jsoncandle_qlora_graph_id,candle_qlora_ce_last_k,training_objective_note, telemetry, and tiered parity tests (candle_burn_*) for shared f32 primitives only — not end-to-end NF4-vs-Burn LM identity.
Burn LoRA vs Candle QLoRA — which path, when (4080 Super and beyond)
Burn R&D charter (bounded)
Burn remains an explicit R&D lane, not production train dispatch. Keep experiments bounded and comparable {
- strict code-only adapter behavior experiment,
- tokenizer/format sensitivity experiment,
- merge-and-serve operational comparison.
All Burn experiments must emit the same mens-scorecard summary/event artifacts with explicit backend tag burn so decisions stay evidence-based across lanes.
Is QLoRA “better” than Burn LoRA?
Not universally. They solve different problems:
| Goal | Prefer |
|---|---|
| Train a real Hugging Face base (e.g. Qwen3.5-4B-Instruct) on 16G VRAM with industry-style NF4 + LoRA | Candle QLoRA (--backend qlora, --tokenizer hf, --model …, CUDA build) |
Full in-tree f32 causal LM on VoxTokenizer JSONL (docs/examples → pairs), merge → vox mens serve without an external runtime | Burn LoRA (--backend lora, legacy path) |
| Apples-to-apples loss with “full decoder” next-token CE on the same architecture | Burn is still the easiest controlled parity lane for the in-tree small model; Candle QLoRA is optimized for real HF checkpoints |
So: QLoRA is “better” for large-model, VRAM-efficient fine-tuning on shipped HF weights. Burn LoRA is “better” for the closed Vox corpus loop and first-class serve/merge in this repo. You may run both in a serious program: Burn for syntax/docs/tooling-shaped adapters on the native head; QLoRA for Qwen-class behavior on HF bases.
Should a 4080 Super workstation use Candle CUDA QLoRA?
Yes, when the target is a real Qwen (or similar) checkpoint and you have built vox-cli with gpu,mens-candle-cuda. That is the documented 16G-class path (preset qwen_4080_16g / --preset 4080). Your Vulkan/wgpu logs still mean Burn is correctly using the GPU; that is not a substitute for Candle CUDA — different stacks.
Strengths and weaknesses (persistent reference)
Burn + wgpu LoRA (PopuliTrainBackend::BurnLora)
| Strengths | Weaknesses |
|---|---|
End-to-end Vox story: corpus JSONL → train → merge-weights → vox mens serve (HTTP) on *.bin / model_merged.bin. | Does not load arbitrary multi-billion HF transformers in f32 on a 16G card; use QLoRA for that. |
Full-graph f32 objective on the in-repo LoraVoxTransformer (honest CE over the graph you compiled). | LoraAttention::merge path requires use_rope == false (GPT-2-style); RoPE stacks stay unmerged or need native LoRA at serve time (see top-of-file gaps). |
| Cross-platform GPU via wgpu (Vulkan / DX12 / Metal); no NVIDIA CUDA toolchain required. | Different model than production Qwen: eval numbers vs HF chat models are not directly comparable. |
Fewer external artifacts: no mandatory tokenizer.json + safetensors** for the default **--tokenizer vox` path. | Optional --tokenizer hf is GPT-2-shaped configs + embed warm-start — still not arbitrary Llama/Qwen full weight training in Burn. |
Candle + qlora-rs QLoRA (PopuliTrainBackend::CandleQlora)
| Strengths | Weaknesses |
|---|---|
| NF4 base + trainable LoRA on real HF shards; VRAM-efficient vs full fine-tune; matches operator expectations for “train Qwen locally”. | Native qwen3_5 hybrid path is now enforced in Candle; keep eval-local quality checks in your promotion gate for each model tier. |
NVIDIA CUDA (and Metal) first-class when built with mens-candle-cuda / mens-candle-metal. | vox-schola serve loads the training run dir (adapter + tokenizer), not standalone merge-qlora merged shards; use vLLM / Ollama.app / HF for those f32 subset exports. |
Strong preflight (qlora_preflight) catches tokenizer / embedding width / shard key issues before long runs. | --qlora-require-full-proxy-stack is intentionally strict and can hard-fail when shard coverage is incomplete. |
Preset family (qwen_4080_16g, 4080, etc.) tuned for 16G cards. | Patch + contract coupling: in-tree qlora-rs patch for stable deep stacks; upgrade pins need care (VOX_PATCH.md). |
Last-minute flight check (before a “real” training push)
Use this as an ordered gate; skip steps that do not apply to your target backend.
- Compile:
cargo check -p vox-cli --features gpu(Burn + CPU QLoRA baseline). For CUDA QLoRA on 4080:cargo check -p vox-cli --features gpu,mens-candle-cuda(release build: ensurevox.exeis not locked by another process on Windows). - CLI/registry drift:
vox ci command-compliance(orcargo run -p vox-cli --features gpu -- ci command-compliance). - Training acceptance profile:
cargo run -p vox-cli -- ci mesh-gate --profile training(alias:mens-gate; see mens-finetune-acceptance-runbook.md). - Language/tooling confidence (orthogonal to trainer):
cargo check --workspace,cargo testfor areas you touched; MCPvox-mcpand orchestrator paths assume a healthyvoxbinary and repo root — see AGENTS.md § orchestration / capability registry. - Data: canonical
train.jsonlunder--data-dir(oftentarget/dogfoodafter corpus mix). Operator mix (vox mens corpus mix --config mens/config/mix.yaml) is strict by default: every non-optionalmens/config/mix.yamlsource must exist and emit at least one row. Use--allow-missing-sourcesfor the old warn-only behavior (automation / first-time trees). A JSON report is written next to the mix output (*.mix_report.json, same stem as the mixed JSONL) with per-source weights, line counts, and output share. Optional:VOX_TRAIN_SKIP_CORPUS_MIX=1when the JSONL is already final. - Choose artifact + inference: Burn →
merge-weights→vox mens serve(execution-api); QLoRA →vox-schola serve/vox mens serve --model <run_dir>(local), ormerge-qlora→ external vLLM / Ollama / HF for merged shards. - Long runs (detached):
--log-diralways re-invokes the current binary with logs redirected and the parent exiting immediately.--backgroundalone does the same using the default log directory (<repo>/mens/runs/logswhen the workspace root is known, elsemens/runs/logsrelative to the process cwd). On Windows, spawns useCREATE_BREAKAWAY_FROM_JOBso IDE/agent job objects are less likely to tear down the trainer when the parent exits.vox mens trainbehaves the same (--backgrounddefaults logs tomens/runs/logs). Monitor withGet-Content …\train_*.log -Wait -Tail 25ortail -f. Gate wrappers:scripts/populi/release_training_gate.ps1(training profile),scripts/mens_release_gate.ps1(m1m4) — isolatedtarget+ tempvox.execopy to avoid Windows file locks during nestedcargo.
“Full model build” in practice means: (a) data corpus at quality gate, (b) trainer chosen and manifest recorded, (c) merge/export aligned with where inference will run (Vox HTTP vs external LLM), (d) eval (vox mens corpus eval / eval-local where applicable) before promoting artifacts.
RTX 4080-class CUDA (16G) — canonical QLoRA (copy-paste)
- Preset:
qwen_4080_16g(rank 16, seq 384, batch 1, grad_accum 8). CLI--preset 4080is an alias of the same profile (defaultDEFAULT_PRESETis4080). - Compile check (CUDA Candle stack):
cargo check -p vox-cli --features gpu,mens-candle-cuda(orcargo vox-cuda-release). - Train (Qwen3.5-4B example):
vox mens train --backend qlora --tokenizer hf --preset qwen_4080_16g --model Qwen/Qwen3.5-4B --data-dir target/dogfood --output-dir mens/runs/qwen35_qlora --device cuda --qlora-require-full-proxy-stack - Qwen3.5 ladder guidance (text native phase):
Qwen/Qwen3.5-0.8B: use--preset qwen_4080_16g(or--preset auto), allow longer seq where VRAM permits.Qwen/Qwen3.5-2B: same preset family; keep moderate sequence lengths for throughput.Qwen/Qwen3.5-4B: canonical 4080 dogfood baseline in this repo.Qwen/Qwen3.5-9B: use tighter sequence and higher grad accumulation on 16G; promote on 24G+ tiers.- Multimodal training/inference is an explicit next phase and is not included in current native text acceptance.
--device cudawithoutmens-candle-cudafails fast at CLI with rebuild instructions.- Local-first safety knobs:
--require-gpufails if runtime resolves to CPU;--allow-cpu-fallback=falsedisables automatic fallback for--device best. - CPU smoke:
VOX_CANDLE_DEVICE=cpuforces Candle on CPU for debugging. - IDE / Cursor timeouts (long builds + train + gates): Hosted agent tools often cap wall time (~tens of seconds to a few minutes). Prefer detach + log instead of blocking a single tool invocation on
mesh-gate(alias:mens-gate; training profile commonly 5–40+ minutes depending on cold compile and disk):- Mens gate: from repo root,
pwsh scripts/populi/release_training_gate.ps1 -Detachorpwsh scripts/populi/release_ci_full_gate.ps1 -Detach— returns immediately; watchtarget/mens-gate-logs/. Same pattern asmens_gate_safe.ps1. For quick local signal without the full gate, run a single targeted test (examples in Regression tests below). - Train:
vox mens train … --backgroundorvox mens train … --log-dir mens/runs/logs— parent exits immediately; monitor withGet-Content mens/runs/logs/train_*.log -Wait -Tail 25(ortail -f). - CUDA
cargobuild: normal terminal orTee-Object; detached build:scripts/populi/cursor_background_cuda_build_detached.ps1(andscripts/mens/…copies if present). Example train launcher:scripts/populi/cursor_background_train_example.ps1. - Skip corpus mix (optional):
VOX_TRAIN_SKIP_CORPUS_MIX=1skips the pre-trainmixrefresh when you already have the desiredtrain.jsonlor need a shorter path under automation.
- Mens gate: from repo root,
- Benchmark telemetry (Codex): set
VOX_BENCHMARK_TELEMETRY=1so select CLI paths append unifiedbenchmark_eventrows (VoxDb::record_benchmark_event, sessionbench:<repository_id>):vox mens bench-completion,vox mens eval-localonly whenvox-cliis built with featuregpu(CPU-only eval skips telemetry rows),vox ci build-timings, optional train gate (VOX_BENCHMARKeval-local subprocess), and the ignoredrun_benchmarkintegration test warm pass. SetVOX_REPOSITORY_ROOTso subprocessrepository_idmatches MCP when CWD differs. Query via MCPvox_benchmark_listwhen Codex is attached. Syntax-K runs can be routed independently withVOX_SYNTAX_K_TELEMETRY=1(metric_type = syntax_k_event, sessionsyntaxk:<repository_id>), with fallback toVOX_BENCHMARK_TELEMETRYwhen unset. Variable SSOT: env-vars; trust framing: telemetry-trust-ssot. - JSONL rows:
vox_tensor::data::TrainingPairacceptsinstructionas alias forpromptandoutputforresponseso corpus rows are not silently dropped. Seemens-training-data-contract.md; setVOX_MENS_TRAIN_JSONL_STRICT=1to fail on malformed non-empty lines instead of skipping them. - Full-graph forward (current implementation): one forward pass per row/micro-batch item over loaded decoder layers, then masked CE on supervised suffix positions.
- Suffix CE (
--qlora-ce-last-k K): default64.K=0uses all supervised assistant positions;K>0uses only the lastKsupervised positions from the trimmed sequence. - Depth ablation (CLI + digest):
--qlora-proxy-max-layers Nand--qlora-lm-head-onlystill feed contract digest / planner / preflight (candle_qlora_proxy_stack_complete, graph id). Candle training rejects LM-head-only,proxy_max_layers=0, and any cap below model depth; run without those flags (or set the cap ≥num_hidden_layers) so the trainer runs the full proxy graph and the manifest matches execution. - Debug:
VOX_QLORA_DEBUG_NORMS=1prints mean-|activation| after each middle block (stderr; local ablation only). - Deferred flags:
--qlora-lm-head-onlyand partial-depth--qlora-proxy-max-layersare intentionally not implemented in the current full-graph trainer; keep them for contract/rollout compatibility only.
Pre-push release gate (acceptance matrix)
- Canonical (cross-platform):
cargo run -p vox-cli -- ci mesh-gate --profile training(add--profile ci_fullfor the wider matrix; alias:mens-gate).
Steps live inscripts/populi/gates.yaml(legacy fallbackscripts/mens/gates.yaml). Nestedcargosteps use OS temp…/vox-targets/<repo-hash>/nested-ciasCARGO_TARGET_DIR(not under repotarget/). - Thin shims:
pwsh scripts/populi/release_training_gate.ps1,pwsh scripts/populi/release_ci_full_gate.ps1,pwsh scripts/mens_release_gate.ps1(m1m4) — all forward toscripts/populi/mens_gate_safe.ps1. Cursor / agent wall-clock limits: runpwsh scripts/populi/release_training_gate.ps1 -Detach(orrelease_ci_full_gate.ps1 -Detach) so a new PowerShell process owns the multi-minute nestedcargo testwork; tailtarget/mens-gate-logs/mens_gate_*.log. Optional-LogFile C:\path\to\gate.logpins the tee path. Bash peers remain where present — mirrorsmens-finetune-acceptance-runbook.mdrows 1–10 (planner, keymap, strict preflight, Burn smoke, parity tests, merge,merge_v2).
Regression tests
- Execution planner + hard gates:
cargo test -p vox-populi execution_planner - QLoRA strict proxy stack (missing middle keys):
cargo test -p vox-populi --features mens-train preflight_strict_rejects_missing_o_proj - Fine-tune digest (
qlora_proxy_max_layers):cargo test -p vox-populi --features mens-train finetune_contract_digest_changes_with_proxy_max_layers - Fine-tune digest (
qlora_ce_last_k):cargo test -p vox-populi --features mens-train finetune_contract_digest_changes_with_ce_last_k - Candle qlora trainer unit tests:
cargo test -p vox-populi --features mens-train - Burn LoRA checkpoint parity tests: use
vox-tensorcrate unit tests where applicable. - Legacy Burn merge parity tests: kept for historical compatibility only.
- Burn linear LR warmup (Burn
LinearLrScheduler):cargo test -p vox-tensor --features gpu --lib linear_warmup_sequence_matches - Candle vs Burn f32 parity touchpoints:
cargo test -p vox-populi --features mens-train --test <parity_test_name> - Tier B NF4 dequant reference parity:
cargo test -p vox-populi --features mens-train --test candle_burn_nf4_dequant_lm_reference_parity - Candle vs Burn cross-entropy parity:
cargo test -p vox-populi --features mens-train --test candle_burn_cross_entropy_parity merge-qlorarejects Burn*.bin:cargo test -p vox-cli merge_qlora_rejects_burn_bin_adaptermerge-weightsrejectscandle_qlora_adapter.safetensors(Burn path only) and points tomerge-qlora:cargo test -p vox-cli merge_weights_rejects_candle_qlora_adapter_filemerge-qloraCLI synthetic roundtrip:cargo test -p vox-cli merge_qlora_cli_roundtrip_lm_head_subset- Adapter v2 merge math:
cargo test -p vox-populi --features mens-train merge_v2_applies_lm_head_delta
Evaluation protocol (trajectory and cost)
Use a small, repeatable local harness before promoting new training knobs:
- Build a mixed eval set with:
- baseline code-completion prompts,
- tool/terminal trajectory prompts,
- explicit success and failure recovery prompts.
- Run two adjacent configurations:
- control (
trajectory_weighting_enabled=false), - candidate (trajectory weighting and/or provenance metadata enabled).
- control (
- Compare:
- trajectory pass rate,
- failure-recovery success rate,
- mean tokens and wall-clock per successful solve (
cost-per-successproxy).
Promotion criteria should require non-regressing baseline quality while improving trajectory metrics.
Rollout gates and env toggles
-
VOX_QWEN35_NATIVE_CUTOVERshadow: allow qwen2 with warning, qwen3_5 preferred.default(default): qwen3_5 preferred; qwen2 requiresVOX_ALLOW_QWEN2_NATIVE=1.enforced: reject qwen2 native training.
-
VOX_ORCHESTRATOR_MESH_TRAINING_ROUTING_EXPERIMENTAL- Enables training-task specific route scoring (still local execution only).
-
VOX_ORCHESTRATOR_MESH_TRAINING_BUDGET_PRESSURE- Soft scalar (0.0-1.0) that penalizes expensive training placements under budget pressure.
-
VOX_ORCHESTRATOR_MESH_ROUTING_EXPERIMENTAL- Existing federation visibility signal; combine with training routing toggle for staged rollout.
Recommended rollout order: shadow (routing_experimental), then training scoring (training_routing_experimental), then budget pressure tuning.
Acceptance criteria and rollout protocol
- A/B baseline: run control (
trajectory_weighting_enabled=false) and candidate with the same data + seed envelope. - 4080-first gate: local RTX 4080 class run must remain non-regressed before enabling any distributed/cloud knobs.
- Staged toggles: enable
VOX_ORCHESTRATOR_MESH_ROUTING_EXPERIMENTALfirst, thenVOX_ORCHESTRATOR_MESH_TRAINING_ROUTING_EXPERIMENTAL, then setVOX_ORCHESTRATOR_MESH_TRAINING_BUDGET_PRESSURE. - Promotion gate: require non-regressing baseline quality plus improved trajectory/failure-recovery metrics.
- Cost guardrail: compare mean wall-seconds and tokens per successful trajectory solve (
cost-per-successproxy) against baseline.
Merge / export / inference
| Command / artifact | Status |
|---|---|
vox mens merge-weights | Merges Burn LoRA checkpoints (*.bin from --backend lora) into model_merged.bin. Requires gpu. |
candle_qlora_adapter.safetensors | LoRA A/B per logical layer (mid0…lm_head); sidecar candle_qlora_adapter_meta.json format vox_mens_qlora_lora_only_v2 (QloraAdapterMetaV2). |
vox schola merge-qlora (alias merge-adapter) | Candle QLoRA path only: merges v2 or v3 adapter meta + LoRA tensors into f32 base shards for keys in base_key_map (subset output safetensors). Distinct from merge-weights and from Burn *.bin checkpoints. There is no supported conversion from Burn *.bin LoRA checkpoints into Candle adapter safetensors for this command — use merge-weights for Burn → model_merged.bin. |
vox mens serve (cloud=local) | Spawns vox-schola serve: QLoRA run directory (adapter + tokenizer). |
vox mens serve (Burn, execution-api) | Loads Burn checkpoints: LoRA *.bin or merged model_merged.bin from merge-weights. Does not apply to Candle merge-qlora output safetensors. |
populi_adapter_manifest_v3.json | Unified adapter manifest (method + quant + layer order + base_key_map); written beside v2 meta on Candle runs. |
| Full causal NF4 + PEFT parity | Open work — deeper block coverage beyond o_proj proxy stack. |
Troubleshooting (Candle QLoRA)
- Non-finite loss at the first micro-step: The trainer runs a masked CE numeric preflight after checkpoint resume (warm-started LoRA weights included) and before the epoch loop. If this fails, fix the reported cause (vocab vs tokenizer, logits NaNs, CUDA numerics) instead of only lowering learning rate.
- Token ids ≥
vocab_size: HF tokenizers can emit ids outside the base model’s embedding table after added-token / checkpoint skew or bad JSONL. The loop skips such rows (counter + one warning withmax_id/vocab_size/pair_real_idx). Preflight errors if the first eligible encoded batch is out of range. - Stricter JSONL validation: Set
VOX_MENS_TRAIN_JSONL_STRICT=1to surface data issues earlier in the pipeline where supported.
Related
- LLM / agent PR hygiene:
mens-llm-pr-checklist.md— LoRA duplication, layouts, merge, CI test names, parity tiers. - LoRA ownership boundary:
mens-lora-ownership.md - Speech / ASR (Oratio):
oratio-speech.md— orthogonal to training; use top-levelvox oratio/vox speech. CLI STT commands needvox-clifeatureoratio(not defaultmens-base).