Vox Speech-to-Code Architecture Research — April 2026
Purpose
This document synthesizes 25+ targeted web searches conducted in April 2026 to determine the optimal, highest-accuracy architecture for feeding spoken audio into Vox's MENS model pipeline. It considers three strategic pillars:
- Best-off-the-shelf ASR — transcribe speech at the lowest WER and feed text straight into MENS.
- Code-domain–adapted ASR — fine-tune an existing model (LoRA/QLoRA) for Rust/TypeScript vocabulary.
- Custom speech-to-code — train or integrate a model purpose-built for dictating identifiers, symbols, and code structure.
The RTX 4080 Super (16 GB VRAM) is the target inference GPU. The Rust/Candle + ONNX/sherpa-onnx ecosystem is the preferred deployment surface, consistent with Vox's existing Burn-based MENS pipeline. Python is acceptable for the training phase only.
1. Baseline WER Landscape (April 2026)
All WER numbers are on standard English benchmark suites (LibriSpeech test-clean / test-other / OpenASR leaderboard composite). Code-domain WER will be higher; see Section 4 for the delta.
| Model | Params | WER (En avg) | RTFx (A100) | VRAM | Streaming | Notes |
|---|---|---|---|---|---|---|
| Cohere Transcribe | — | 5.42% | 524× | API-only | No | Top API, closed |
| Canary-Qwen 2.5B (NVIDIA) | 2.5 B | 5.63% | ~418× | ~10 GB | No (batch) | SALM; FastConformer + Qwen decoder |
| Qwen3-ASR-1.7B (Alibaba) | 1.7 B | ~5.7% | RTF 0.015–0.13 | ~8 GB | Yes (unified) | AuT encoder + Qwen3 decoder |
| IBM Granite Speech 3.3 8B | 8 B | 5.85% | — | ~16 GB | No | Fits 4080S just; enterprise |
| Deepgram Nova-3 | — | 5.26% | — | API-only | Yes | Best API; domain variants |
| Whisper Large-v3 | 1.54 B | 6.8% | ~180× | ~10 GB | No | 99+ languages; batch |
| Whisper Large-v3-Turbo | ~809 M | ~7.0–7.2% | ~6× large-v3 | ~6 GB | No | 4-decoder-layer distillation |
| Distil-Whisper large-v3 | ~756 M | ~7.1–7.5% | ~6× base | ~5 GB | No | 2-decoder-layer distillation |
| Faster-Whisper (CTranslate2) | same | same | 2–4× over OpenAI | −40% VRAM | No | Inference engine, not model |
| NVIDIA Parakeet-TDT 1.1B | 1.1 B | ~5.8% | >2 000× | ~6 GB | Yes (native) | FastConformer + TDT decoder |
| Moonshine Medium | ~330 M | ~7–8% | 40×+ vs Lv3 | ~2 GB | Yes (native) | RoPE; TTFT <150 ms |
| Vosk | ~50 MB | ~12–18% | fastest CPU | <1 GB | Yes | Extreme edge; low accuracy |
Key insight: Parakeet-TDT offers near–Canary accuracy at >2 000× RTFx in a fully streaming mode. Canary-Qwen and Qwen3-ASR-1.7B are the top-tier LLM-decoder hybrids for max accuracy but require batch or chunked inference rather than true sub-utterance streaming.
2. Architecture Concepts for Quality Maximization
2.1 Why Decoder Architecture Determines Code WER
| Decoder | Context | Why matters for code |
|---|---|---|
| CTC | None (label independence assumed) | Collapses repeated frames but cannot correct which token is most likely given adjacent tokens — identifier homonyms explode WER. |
| Transducer (RNN-T / TDT) | Prediction network ≈ internal LM | Can model getItem vs get_item if the vocabulary is seeded correctly. Native streaming. |
| Attention Encoder-Decoder (AED) | Global (full utterance) | Best correction but requires full audio. Whisper and Canary-Qwen use this. |
| SALM (AED + LLM decoder) | Full audio + LLM world knowledge | LLM decoder already knows Rust/TS syntax. Can produce unwrap_or_else naturally. Best for code. |
2.2 The Preprocessing Stack (and What to Skip)
Research confirms a counter-intuitive finding: aggressive conventional noise filtering hurts modern neural ASR because it removes formant transitions used by the encoder. The optimal input pipeline is:
[Mic / WAV]
→ Resample to 16 kHz mono
→ RMS loudness normalization (target ~−18 dBFS)
→ Silero-VAD (ONNX; 512-sample = 32 ms chunks @ 16 kHz)
↳ discard silence → prevents Whisper hallucinations
→ Buffer speech segments
→ Log-Mel spectrogram (80 or 128 channels, 25 ms window, 10 ms stride)
→ Feed to ASR model
Do NOT apply: Wiener filtering, spectral subtraction, or heavy noise gate before the ASR encoder. Use a noise-trained model instead (Canary, Qwen3-ASR, etc.).
2.3 Chunk Sizing and Latency Budget
For a code dictation scenario the latency budget is generous (developer is speaking intent, not reacting to sound). Recommended:
| Stage | Chunk size | Expected latency |
|---|---|---|
| VAD (Silero) | 32 ms | <1 ms per chunk on CPU |
| Streaming fast-path (Moonshine/Parakeet) | 160–320 ms | TTFT ~150–300 ms |
| Accuracy batch pass (Canary/Qwen3-ASR) | Full utterance (on silence/endpointing) | 200–800 ms |
| LLM post-correction (Qwen3-0.6B) | Per sentence | ~100–250 ms on 4080S |
Two-pass streaming: deliver a Parakeet-TDT or Moonshine transcript immediately for typing echo, then replace with Canary/Qwen3-ASR output once silence is detected. The MENS model always receives the high-accuracy batch-pass output.
3. Recommended Rust Architecture
3.1 Crates and Runtime Boundaries
audio input (cpal or rodio)
│
▼
vox-voice ─── owns all ASR logic
├── silero_vad_rs (stateful VAD per stream, ONNX/ort)
├── asr_backend (trait: transcribe_segment(audio) → TranscriptResult)
│ ├── WhisperBackend (candle-based; fastest to ship)
│ ├── CanaryBackend (sherpa-onnx or ort; ONNX export from NeMo)
│ └── Qwen3AsrBackend (sherpa-onnx; official ONNX release)
├── post_processor::CodeCorrector (Qwen3-0.6B ONNX / ort)
├── context_biaser (prefix tree / TCPGen hotword injection)
└── transcript_sink → MENS input channel (async tokio mpsc)
Trait design (SSOT for all backends):
#![allow(unused)] fn main() { /// vox-voice/src/asr_backend.rs #[async_trait::async_trait] pub trait AsrBackend: Send + Sync { async fn transcribe(&self, pcm: &[f32]) -> anyhow::Result<TranscriptResult>; fn name(&self) -> &'static str; fn supports_streaming(&self) -> bool { false } } pub struct TranscriptResult { pub text: String, pub confidence: f32, // 0.0–1.0; from log-prob pub n_best: Vec<String>, // top-K hypotheses for LLM rescoring pub word_timestamps: Vec<(String, f32, f32)>, } }
This pattern means adding Canary is simply implementing AsrBackend on a new struct that wraps the sherpa-onnx or ort session. No changes to the MENS pipeline.
3.2 ONNX vs Candle: When to Use Each
| Criterion | Candle | ONNX Runtime (ort) |
|---|---|---|
| Pure-Rust, no native libs | ✅ | ❌ (needs shared .dll/.so) |
| TensorRT execution provider | ❌ | ✅ |
| FastConformer (Canary encoder) | Needs hand-implementation | ✅ via NeMo ONNX export |
| Whisper | ✅ (existing impl) | ✅ via faster-whisper export |
| INT8 / FP16 quantization | Partial | ✅ full support |
| Streaming-stateful (RNN-T) | Hard | ✅ via sherpa-onnx |
Practical decision tree:
- Ship Whisper immediately via Candle (already supported in the Vox ML ecosystem, aligns with
vox-tensor/Burn patterns). - Integrate Canary / Qwen3-ASR via
sherpa-rs+ ONNX Runtime. NeMo supportsmodel.export("model.onnx")natively. - Use TensorRT EP on RTX 4080 Super for production throughput; FP16 by default, INT8 only if profiling shows VRAM pressure.
3.3 Silero-VAD in Rust (Concrete)
#![allow(unused)] fn main() { // Cargo.toml [dependencies] silero-vad-rs = "0.3" ort = { version = "1.17", features = ["cuda"] } // Usage let model = SileroVAD::new("models/silero_vad.onnx")?; let mut vad = VADIterator::new(model, 0.5, 16_000, 100, 30); // In audio capture loop: loop { let chunk: Vec<f32> = mic.read_512_samples()?; // 32 ms @ 16 kHz if let Some(speech_event) = vad.process_chunk(&chunk)? { // queue chunk into speech_buffer } } }
Cost: <1 ms per 32 ms chunk on CPU. Zero GPU required for VAD stage.
4. Code-Domain WER: Baseline vs. Adapted
This is the critical question. Synthesized estimates from 2025 domain adaptation studies:
| Scenario | Est. WER (English prose) | Est. WER (Rust code identifiers) | Notes |
|---|---|---|---|
| Whisper Large-v3 (raw) | 6.8% | 25–40% | Catastrophic on snake_case, macros |
| Whisper-Turbo (raw) | 7.2% | 28–42% | Similar; slightly worse |
| Canary-Qwen (raw) | 5.6% | 18–28% | LLM decoder helps significantly |
| Qwen3-ASR-1.7B (raw) | ~5.7% | 15–25% | Qwen3 base knows code |
| Whisper Large-v3 + LoRA (code corpus) | ~7% | 8–14% | LoRA on decoder only; 10–20% relative gain |
| Canary-Qwen + code hotword biasing | ~5.6% | 10–18% | Hotword prefix tree biasing |
| Qwen3-ASR-1.7B fully adapted | — | 6–10% (estimated) | Best realistic target |
| + MENS Qwen3-0.6B post-correction | — | 4–8% (estimated) | LLM corrector uses surrounding code context |
Estimated achievable WER for Vox speech-to-code (~4–8%): This assumes (a) Qwen3-ASR-1.7B as the backbone, (b) runtime hotword biasing injecting identifiers declared in the current open file, and (c) a Qwen3-0.6B post-correction pass fine-tuned on (ASR-output, corrected-code) pairs from the Vox corpus.
Why WER on code is so high without adaptation:
unwrap_or_elsesounds like "unwrap or else" → 3 words vs 1snake_casecase-folding by default destroys identifiers- Library names (
tokio,anyhow,serde) lack pronunciation priors - Punctuation (
::,->,?) is completely ignored by standard ASR - Rust keywords (
impl,pub(crate),dyn) have rare phonetic patterns
5. Fine-Tuning / Training Pathway
5.1 LoRA Adapter on Whisper or Qwen3-ASR
Language: Python (training); Rust (deployment inference only).
1. Generate synthetic audio corpus (Piper TTS, local + free):
- Read Vox codebase Rust files as "spoken text"
- Normalize: "pub fn" → "pub fn" (preserve case for decoder)
- Add speed perturbation ±10%, room-impulse-response augmentation
- Target: ~50–100 h synthetic + any real developer voice recordings
2. HuggingFace PEFT LoRA config:
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
lora_config = LoraConfig(r=32, lora_alpha=64,
target_modules=["q_proj","v_proj"],
lora_dropout=0.05)
model = get_peft_model(model, lora_config)
# Train decoder-only; freeze encoder entirely
3. Evaluate on holdout Vox dictation sessions:
- Metric: per-identifier WER (strict, no normalization of case)
- Also: syntactic validity rate (does rustfmt accept the output?)
4. Export: merge LoRA weights → .safetensors → convert to ONNX/CTranslate2
5.2 Domain Adapter for Qwen3-ASR (Preferred Path)
Qwen3-ASR-1.7B has a dual-module architecture: AuT audio encoder (~300 M params) + Qwen3-1.7B LLM decoder. The LLM decoder already understands Rust syntax from pretraining. This makes the adaptation much cheaper:
- Fine-tune only the LLM decoder with LoRA using text-only code correction data (ASR output → correct code) — no audio needed.
- Train on a corpus of (Whisper-misrecognition, correct Vox code) pairs.
- RTX 4080 Super (16 GB) can comfortably run 4-bit QLoRA on 1.7B decoder.
5.3 Integration with MENS Training Pipeline
Since Vox already uses Burn + QLoRA for MENS domain adapters:
MENS Training Pipeline (existing)
└── Corpus: Rust source, Markdown, Synthetic
└── Domain adapters: vox-lang, rust-expert, agents
NEW: asr-voice-adapter domain
└── Corpus: (spoken-command-audio, code-text) pairs
├── Source A: Piper-synthesized Vox files
├── Source B: Developer session recordings (opt-in telemetry)
└── Source C: Zero-shot Qwen3 text correction pairs
└── Model: Qwen3-ASR-1.7B decoder LoRA (merged at inference)
└── Evaluation: dictation WER on Vox codebase holdout
The ASR domain adapter lives in crates/vox-populi/src/domains/asr_voice/ and is selected by vox populi train --domain asr-voice.
6. Hotword / Context Biasing at Runtime
The single biggest practical gain in code-domain ASR is injecting context from the open file at inference time. Two techniques:
6.1 Shallow Fusion (n-gram)
Build a unigram/bigram language model from the symbols declared in the current open file (variables, function names, types). Merge its log-probability scores with the ASR beam search at decoding time.
- Works with Whisper via
faster-whisper'sinitial_promptor via custom CTC/Beam hook. - Trivially extractable from
rust-analyzerLSP symbol table. - Cost: negligible.
6.2 Tree-Constrained Pointer Generator (TCPGen)
An auxiliary neural module that maintains a prefix tree of the hotword list and dynamically adjusts token probabilities during attention-based decoding. Reported 15–30% relative WER improvement on rare-term benchmarks.
- Requires mild model surgery; more applicable to Canary than Whisper.
- Can be implemented as a second inference head; ONNX-exportable.
Recommended practical approach for Vox v1:
#![allow(unused)] fn main() { // vox-voice/src/context_biaser.rs pub struct ContextBiaser { /// Symbols from rust-analyzer LSP hover/symbols response symbols: Vec<String>, boost_score: f32, // typically 1.5–2.5 log-prob bonus } impl ContextBiaser { pub fn build_initial_prompt(&self) -> String { // For Whisper: prepend symbol list as text prompt // Guides decoder attention toward known identifiers self.symbols.join(" ") } } }
7. Post-Processing Stack (LLM Correction)
7.1 Pipeline
ASR Raw Output (Qwen3-ASR or Whisper)
│
▼
[1] Punctuation & Capitalization Restorer
→ Qwen3-0.6B LoRA fine-tuned on code-ASR pairs
→ Adds :: . () {} ; ? at correct positions
│
▼
[2] Identifier Normalizer
→ Regex + LSP cross-reference: "get item" → getItem / get_item
→ Heuristic: if camelCase match exists in symbol table → prefer
│
▼
[3] Code Validator (optional)
→ rustfmt --check / tsc --noEmit on buffer substring
→ Flag low-confidence segments if invalid parse
│
▼
[4] MENS Input Channel
→ Passes structured TranscriptResult to MENS orchestrator
→ Includes n_best list, word timestamps, confidence score
Hallucination guard: The Qwen3-0.6B corrector must only modify tokens from the ASR n-best hypotheses list. If it tries to generate tokens not in any hypothesis, revert to the top-1 ASR output. This prevents over-correction.
7.2 Metrics Beyond WER
For code dictation, WER is insufficient. Track:
| Metric | Definition | Target |
|---|---|---|
| Identifier Accuracy Rate (IAR) | % identifiers transcribed exactly correct | >85% |
| Syntactic Validity Rate (SVR) | % utterances that rustfmt-parse cleanly | >70% |
| Symbol Match Rate (SMR) | % output tokens that match active LSP symbol table | >78% |
| TTFT (streaming) | Time to first readable token | <300 ms |
| End-of-Utterance Latency (EUL) | Total latency to final corrected text | <1 500 ms |
8. Strategic Options Summary
Three viable architectures, ordered by investment:
Option A — Whisper + Candle + QLoRA Adapter (Lowest Effort)
WER estimate: 8–14% on code identifiers
- Use existing
candle-whisperbindings in the Vox ML ecosystem. - Add Silero-VAD crate for speech segmentation.
- Train QLoRA adapter on Piper-synthesized Vox codebase audio.
- Add
initial_promptcontext biasing from open file symbols. - Pass output to MENS with a lightweight Qwen3-0.6B text correction.
- All Rust at inference time (Candle + ort).
Time to ship: 2–4 weeks
Option B — Qwen3-ASR-1.7B + sherpa-rs/ONNX + Full Stack (Recommended)
WER estimate: 4–8% on code identifiers
- Export Qwen3-ASR-1.7B to ONNX via official Qwen toolchains.
- Integrate via
sherpa-rscrate with CUDA EP on RTX 4080 Super. - Fine-tune LLM decoder via text-only LoRA (no audio needed for adaptation).
- Deploy two-pass streaming: Parakeet-TDT for UI echo (2 000× RTF), Qwen3-ASR for final MENS input.
- Full post-processing stack (Section 7).
Time to ship: 4–8 weeks
Option C — Custom Speech-to-Code Model (Highest Accuracy, Highest Effort)
WER estimate: 2–5% on code identifiers (theoretically)
- Train a purpose-built model: FastConformer encoder + code LLM decoder (e.g., Qwen3-Coder).
- Train with NeMo on a dataset of developer sessions (real audio) + Piper synthetic.
- Requires 200–500 h of gpu-training time on RTX 4080 Super or rented cloud GPU (Vast.ai A100).
- Enables Vox-MENS to receive ASR embeddings directly rather than text, bypassing the text bottleneck.
- Eventually: a single model that accepts audio → produces Vox language AST directly.
Time to ship: 3–6 months
9. Integration Points with Existing Vox Codebase
| Where | What changes |
|---|---|
crates/vox-populi/src/domains/ | Add asr_voice domain with QLoRA recipe |
crates/vox-voice/ | New crate — owns VAD, ASR backends, post-processor |
crates/vox-cli/src/commands/ | Add vox voice start / vox voice calibrate / vox voice status |
crates/vox-clavis/src/lib.rs | No new secrets if fully local; add VOX_DEEPGRAM_API_KEY only for optional cloud fallback |
contracts/operations/ | Add voice-retention.v1.yaml for audio session retention policy |
docs/src/reference/cli.md | Document vox voice subsystem |
crates/vox-db/ | Schema addition: voice_sessions table (audio hash, WER estimate, correction log) |
10. Recommended Immediate Action
Based on all research, the recommended path for 2026 is:
- Ship Option A (Whisper/Candle) as v0 — to get something working and build the evaluation harness.
- Collect real dictation data — developer voice sessions with opt-in recording, stored per
workspace-artifact-retention.v1.yaml. - Fine-tune Qwen3-ASR-1.7B on code corpus (Option B decoder LoRA) — takes ~1–2 GPU-days on the 4080 Super.
- Instrument WER tracking in
vox-db— every dictation session logs estimated identifier error rate. - Plan Option C as a 2026 H2 stretch goal once Option B ships and data volume justifies custom training.
Sources: Hugging Face Open ASR Leaderboard (April 2026), NVIDIA NeMo docs, Qwen3-ASR tech report (arXiv:2601.21337), sherpa-onnx / sherpa-rs crates.io, silero-vad-rs docs.rs, WER domain-adaptation studies (INTERSPEECH 2024–2025), and 25 targeted web searches conducted April 2026.