"ADR 003 — Native Rust Training Over Python"

ADR 003 — Native Rust Training Over Python

Status: Accepted; amended 2026-04-06
Date: 2026-03-02 (original decision)
Author: Bert Brainerd

Current product path: Large-model QLoRA fine-tuning runs entirely in RustCandle, qlora-rs, and vox mens train (--backend qlora, --tokenizer hf by default). Python / Unsloth described below is historical context only, not an operator requirement.


Historical context (why we left Python)

The original Mens training pipeline used mens/training/train.py (Python, Unsloth, QLoRA). That caused:

  1. Environment friction: Python version conflicts, uv/pip pinning, CUDA version mismatches
  2. Slow iteration: Python-based tokenizer was ~10× slower than native Rust for our dogfood path
  3. Philosophical mismatch: Vox could not dogfood training if the loop lived in another language
  4. CI complexity: Separate Python setup and heavy deps on every CI run

Original decision (March 2026): Move the bulk of the pipeline to native Rust (Burn 0.19 for scratch LoRA / experimentation), and initially assumed Python might remain for some large-model QLoRA work.

Amendment: Native Candle + qlora-rs now covers HF-weight QLoRA in-tree. See ADR 006 — Mens full-graph Candle QLoRA with qlora-rs, ADR 007 — qlora-rs multi-layer training API, and the SSOT Mens native training.


Current architecture (summary)

ConcernHistorical (pre–native QLoRA)Current
Tokenizer (dogfood / VoxTokenizer JSONL)PythonRust (VoxTokenizer in vox-tensor)
Data loading (JSONL)Python loopRust JsonlDataLoader
Synthetic / CLI data generationscripts/datagen.pyvox generate-data (Rust)
Scratch / Burn LoRA (small model, wgpu)Python training loopvox training native / Burn paths in vox-tensor (legacy vs vox mens train dispatch — see SSOT)
HF QLoRA (large models)Python (Unsloth)Rust: vox mens trainCandleQlora + qlora-rs; weights via Rust hf-hub
Corpus extractionPythonvox mens corpus extract (Rust)
Training validationPythonvox mens corpus eval (Rust via vox-eval)

Dispatch note: vox mens train is the canonical operator CLI. PopuliTrainBackend::BurnLora is rejected at runtime; the supported in-dispatch trainer for Mens fine-tuning is CandleQlora. Burn remains relevant for legacy checkpoints, vox mens merge-weights, and vox mens serve on merged .bin — not as the primary QLoRA path. Details: mens-training.md.


Implementation pointers

  • Candle QLoRA / contract / preflight: crates/vox-populi/src/mens/tensor/ (run_mens_training, lora_train.rs, finetune_contract.rs, preflight_train.rs)
  • Tokenizer + JSONL loader: crates/vox-tensor/src/data.rs
  • Burn model / optim (feature-gated): crates/vox-tensor/src/vox_nn.rs, optim.rs, train.rs
  • CLI: crates/vox-clivox mens train, corpus and eval subcommands; training/native.rs, training/datagen.rs where applicable

Consequences

Positive

  • No Python required for HF QLoRA fine-tuning in the default product path.
  • Native tokenizer remains fast for VoxTokenizer-shaped JSONL.
  • Single vox binary for data gen, corpus, eval, and Mens train.
  • Stronger Windows story than a Python+CUDA training stack.
  • Training data schema enforced in Rust (TrainingPair, contracts, preflight).

Negative / limits (see SSOT, not “use Python”)

  • Execution kernel gaps: Full causal NF4 blocks and other limits are documented in candle-full-graph-feasibility.md and mens-training.md.
  • Serving: Merged QLoRA artifacts are aimed at external runtimes (vLLM, Ollama, HF, OpenAI-compatible); vox mens serve today targets the Burn merged-weights lane.
  • Burn ecosystem (where still used): fewer optimizers than PyTorch; cold wgpu builds can be heavy — mitigated by feature flags.
  • Optional legacy: Old Python scripts may still exist in trees or forks for one-off experiments; they are not the documented or dispatched path for Mens QLoRA.

References