QLoRA Fine-tuning Data Strategy & SSoT
last_updated: 2026-03-22
[!IMPORTANT] This document is the Single Source of Truth for Vox Mens's QLoRA data scaling requirements and continuous assimilation pipeline. DO NOT attempt to "pad" the pipeline with a stale
examples/directory.
1. Minimal Data Size Requirements
Research on code-style adaptation in Large Language Models via QLoRA concludes that data quality trumps raw quantity, but a strict minimum threshold exists to prevent catastrophic overfitting:
- General Style Changes / Simple Tasks: 400 to 1,000 high-quality examples minimally required.
- Complex Domain Inference (Vox Native Rules): 1,000 to 5,000 examples.
- Anti-pattern to avoid: Finetuning with extremely small sets (< 120 samples) practically guarantees catastrophic overfitting, essentially treating the tuning target like a few-shot prompt.
Historically, Vox accumulated ~19 files in an examples/ directory. This was vastly too small for QLoRA, leading to severe model degradation and overfitting.
2. Continuous Ingestion Pipeline
To satisfy the > 1000 sample requirement without building a stale monolithic examples folder, Vox's native vox mens corpus data pipeline implements a continuous ingestion strategy. This guarantees zero architectural drift by generating ML instructional pairs from live code:
- Rust Crate Source (
crates/**/*.rs)- Extracts live function definitions,
docstrings, and signatures mapping to Vox internal patterns. - Yields ~3,000+ samples naturally.
- Extracts live function definitions,
- Markdown Documentation (
docs/src/**/*.md)- Parses the actual documentation site, building Q&A instructional pairs dynamically based on
voxcode blocks. - Yields ~1,500+ samples.
- Parses the actual documentation site, building Q&A instructional pairs dynamically based on
- Synthetic Generation (
crates/vox-cli/src/training/datagen.rs)- Template-based dynamic code expansion to satisfy complex component and workflow structural coverage.
- Yields ~2,000+ samples.
This pipeline seamlessly creates a training corpus of >10,000 pairs, ensuring perfectly aligned Mens models as the Vox compiler automatically scales learning alongside real logic changes.
3. Lane segmentation policy (code-first default)
The corpus now carries explicit metadata per row:
lane:vox_codegen,vox_docs_qa,vox_tooling,vox_speechresponse_mode:code_onlyorprose_onlytask_family: granular task tag for sampling and analysis
Operational default for production training is vox_codegen only, so prose supervision does not leak into code-only generation behavior.
Documentation Q&A remains available as a separate lane for future multi-lane runs.