ADR 006: Mens full-graph Candle QLoRA with qlora-rs
Status
Accepted (2026-03-21)
Context
Mens ships native --backend qlora using qlora-rs 1.0.5 and Candle: a frozen mmap f32 embedding table (wte / model.embed_tokens.weight) for context, plus one or more NF4 QuantizedLinear modules trained via QLoraTrainer::training_step_lm (sequential stack when HF shards include every expected block output projection; otherwise LM head only).
Product goals (Phase 2c) require deeper use of base weights: per-layer attention output projections (and eventually broader coverage), multi-tensor adapter export, optional merge into base-shaped f32 shards, and clarity on double quantization.
Decision
-
Training API (Approach A — in-tree, public qlora-rs only)
qlora-rstraining_step_lmacceptslayers: &[&QuantizedLinear]and applies them sequentially (for layer in layers { logits = layer.forward(&logits)? }). The optimizer is initialized from the trainer’s singleVarMap, so multipleQuantizedLinearlayers created with distinctVarBuilderprefixes are supported without forking qlora-rs. -
Full-graph scope (incremental)
We expand the trainer by stacking optional middle blocks loaded from HF safetensors when present:- GPT-2:
h.{i}.attn.c_proj.weight— shape[d_model, d_model]. - Qwen2 / LLaMA-style (
model_type/architecturescontainingLlama,Qwen,Mistral, etc.):model.layers.{i}.self_attn.o_proj.weight— shape[d_model, d_model].
If no per-layer weights are found, behavior falls back to the LM-only path (backward compatible).
This is not a full causal transformer forward (no MHA/FFN block yet); it is the supported bounded proxy v1 (
candle_qlora_proxy_v1in manifests /training_objective_note), including optional suffix LM via--qlora-ce-last-k(see mens-training.md). Naming in telemetry:trainable_projection_stack/candle_qlora_graph_id. - GPT-2:
-
Double quantization
QLoraConfigembedsQuantizationConfigwithdouble_quant: bool. Presets (preset_qv_bf16, etc.) defaultdouble_quant: true. Mens exposes a CLI flag to disable double quant for debugging; default remains on (paper-style). -
Burn LoRA + HF tokenizer
Burn training consumes VoxTokenizer JSONL viavox_tensor::data::load_all. Wiring Hugging Face tokenization into the Burn path would require a parallel data pipeline and is deferred. CLI continues to reject--backend lora+--tokenizer hfwith a message pointing to--backend qlora. -
Adapter format v2 + merge
Adapters export LoRA matrices per logical layer (mid0, …,lm_head) with sidecar JSON mapping adapter prefixes → base safetensors keys.vox schola merge-qloramerges LoRA deltas into f32 base tensors for those keys (reload for inference outside this ADR).
Consequences
- Root
Cargo.tomlmust keepqlora-rsworkspace pin aligned withvox-populioptional deps (mens-candle-qlora). - SSOT:
mens-training.mdandref-cli.mdmust listmerge-qloraand--qlora-no-double-quant. - CI:
cargo test -p vox-populi --features mens-trainand targetedvox-clitests cover export/merge smoke paths.
References
- qlora-rs 1.0.5
src/training.rs,src/qlora.rs(local registry copy) - QLoRA paper: https://arxiv.org/abs/2305.14314