Oratio & speech SSOT (Candle Whisper, no whisper.cpp)
Why
- STT without clang/native C++ toolchains: inference is Hugging Face Candle (Rust), not whisper.cpp bindings.
- One refined transcript path: consumers use display/refined text where Oratio applies
light_trimafter decode.
What (artifacts)
| Piece | Role |
|---|---|
vox-oratio | Candle Whisper, symphonia decode, transcribe_path, eval (WER/CER), env VOX_ORATIO_*. |
vox-cli vox oratio | CLI transcription + status + sessionized listen flow (Enter-or-timeout, correction profile, route mode). |
vox-mcp | vox_oratio_transcribe (thin STT + refine), vox_oratio_listen (session + route + optional LLM polish), vox_oratio_status (+ JSON schemas in tool registry). |
vox-vscode | onCommand for contributed vox.* commands + onView sidebar + *.vox; Oratio palette + Explorer (audio, case-insensitive ext); relative MCP path or .vox/tmp/ copy; voice → WAV. See speech capture architecture. |
vox-db + HTTP/OpenAPI | Codex/audio routes per codex-api.openapi.yaml — no vox-codex-api package (see Codex HTTP API). |
| Typeck / codegen | Builtin Speech, Speech.transcribe(path) → Result[str] → vox_oratio::transcribe_path + refined text. |
| Corpus mix | record_format: asr_refine + schema mens/schemas/asr_refine_pairs.schema.json. |
| LSP | Hover for Speech; transcribe only when the line looks like Speech.transcribe (builtin_hover_markdown_in_line). |
| TS codegen | Speech.transcribe → throw (points at examples/oratio/codexAudioTranscribe.ts + @server / HTTP). |
| TS example | examples/oratio/codexAudioTranscribe.ts — fetch for /api/audio/status and /api/audio/transcribe. |
Who / when
- Implementers:
vox-compiler(typeck, codegen),vox-lsp,vox-cli,vox-mcp,vox-vscode,vox-db,vox-corpus. - When to touch: any change to Oratio env vars, transcript shape, HTTP contract, or builtin
SpeechAPI.
Where (files)
crates/vox-oratio/— STT +eval,traits,refine,backends/*crates/vox-cli/src/commands/oratio_cmd.rscrates/vox-orchestrator/src/mcp_tools/tools/oratio_tools.rs,mod.rs(registry + schemas)vox-vscode/src/speech/registerOratioSpeechCommands.ts,src/core/VoxMcpClient.ts(Oratio MCP wrappers)crates/vox-capability-registry/,crates/vox-tools/(mens_chat+DirectToolExecutor; Mens chat ∩ executor)crates/vox-db/src/— Codex store + readiness helpers consumed by HTTP surfaces.crates/vox-compiler/src/typeck/—Speech/ builtins.crates/vox-compiler/src/codegen_rust/—Cargo.tomltemplate +MethodCallforSpeechcrates/vox-compiler/src/codegen_ts/—Speech.transcribestubcrates/vox-lsp/src/lib.rs—word_at_position,line_has_speech_transcribe,builtin_hover_markdown_in_line;main.rs— hoverexamples/oratio/codexAudioTranscribe.ts,examples/oratio/README.mdcrates/vox-corpus/src/corpus/mix.rs—record_format,normalize_training_jsonl_linemens/schemas/asr_refine_pairs.schema.json,mens/config/mix.example.yamlAGENTS.md,docs/src/reference/cli.md,mens-training.md, this file
How (contracts)
- Build check:
cargo check -p vox-oratio --features stt-candle; for thevoxCLI Oratio commands,cargo check -p vox-cli --features oratio(Oratio is not in defaultmens-base). - Env:
VOX_ORATIO_MODEL,VOX_ORATIO_REVISION,VOX_ORATIO_LANGUAGE,VOX_ORATIO_CUDA(feature-gated),VOX_ORATIO_WORKSPACE(HTTP path resolution),VOX_DASH_HOST/VOX_DASH_PORT(dashboard bind),VOX_ORATIO_SPEECH_LEXICON_PATH(optional JSON lexicon percontracts/speech-to-code/lexicon.schema.json, applied after refine; merged with$VOX_REPOSITORY_ROOT/.vox/speech_lexicon.jsonor$VOX_REPO_ROOT/.vox/speech_lexicon.jsonwhen those roots are set — explicit lexicon file wins on conflicting alias keys). Contextual bias / rerank:VOX_ORATIO_CONTEXTUAL_BIAS(0/falseto disable),VOX_ORATIO_SESSION_HOTWORDS(comma-separated boosts),VOX_ORATIO_MAX_BIAS_PHRASES(cap). Decoder-time constrained decode:VOX_ORATIO_LOGIT_BIAS_STRENGTH,VOX_ORATIO_LOGIT_BIAS_MAX_TOKENS,VOX_ORATIO_LOGIT_FORBID_TOKENS,VOX_ORATIO_CONSTRAINED_TRIE,VOX_ORATIO_CONSTRAINED_PHRASES,VOX_ORATIO_TRIE_STUCK_STEPS. Acoustic preprocess (Whisper path):VOX_ORATIO_ACOUSTIC_PREPROCESS(none|peak_normalize),VOX_ORATIO_ACOUSTIC_PREPROCESS_BUDGET_MS(default ~25ms wall budget; returns original PCM if exceeded). Streaming stubs (for live clients):VOX_ORATIO_STREAM_PARTIAL_QUIET_MS,VOX_ORATIO_STREAM_MAX_WAIT_MS— seevox_oratio::StreamingStabilizationConfig. Long-file chunking (Candle encoder window; optional):VOX_ORATIO_CHUNK_SEC(e.g.20–28,5–28clamped),VOX_ORATIO_CHUNK_OVERLAP_SEC(default0.5), optionalVOX_ORATIO_EMIT_PARTIAL_PATH(append JSONL per chunk),VOX_ORATIO_STREAM_TOKENS(token-level event emission in streaming decoder loop). Optional runtime TOML: setVOX_ORATIO_CONFIGto a file with flat keys (capture_timeout_ms,max_duration_ms,inference_deadline_ms,heartbeat_ms, refine/routing/HF/LLM tunables pluslogit_*keys — seecrates/vox-oratio/src/runtime_config.rs). Env overrides file (precedence: CLI args → env → file → defaults for programmatic surfaces; CLI flags win onvox oratio listen). With thecudafeature, default inference is CPU untilVOX_ORATIO_CUDA=1; status JSON includescuda_feature_enabled,cuda_requested_via_env,inference_note.RUST_LOG=vox_oratio_gpu=infoemitsoratio_inference_cpu_defaultvsoratio_inference_gpuon first session load. - Session payloads (CLI
listen, MCPvox_oratio_transcribe/vox_oratio_listen,vox-toolsdirect executor) support:timeout_ms(UX / capture contract),max_duration_ms(session wall cap), optionalinference_deadline_ms(transcribe+refine post-hoc cap),heartbeat_ms,language_hint,profile(conservative|balanced|aggressive),route_mode(none|tool|chat|orchestrator),debug_parser_payload. Responses may includelanguage_diagnostics,deadline_diagnostics, and MCPruntime_configwhen debugging. - n-best transcripts: MCP
vox_oratio_transcribeandvox_oratio_listenexpose optionaln_best(best-firststring[]) when contextual reranking yields multiple candidates; the listen response also includes the same list on the nestedsessionobject. Omitted when only one hypothesis survives rerank. - Routing session memory (tool/chat/orchestrator classifier state): bounded with TTL + max session keys — override with
VOX_ORATIO_ROUTING_SESSION_CAP(default 4096, floor 64) andVOX_ORATIO_ROUTING_SESSION_TTL_SECS(default 86400s, floor 60s). - HTTP transcribe body:
{"path":"relative-or-absolute","language_hint":null}; multipart upload:POST /api/audio/transcribe/uploadwith fieldaudioorfile(seevox-audio-ingress,contracts/codex-api.openapi.yaml). - HTTP streaming WS:
GET /api/audio/transcribe/stream(WebSocket). Binary messages are PCMs16lemono @ 16 kHz chunks; text control messages are JSON ({"op":"set_language","language_hint":"en"},{"op":"commit"},{"op":"cancel"}). Server emits JSON text eventsready,partial,final,error. - Mix YAML: optional per-source
record_format: asr_refine.
Related
- Speech-to-code pipeline (MCP validation parity, corpus
speech_to_code, KPI contracts):speech-to-code-pipeline.md. - Native fine-tuning (Burn LoRA /
vox mens train):mens-training.md. - Mens chat tool allowlist:
vox-toolsmodulemens_chat(chat_tool_definitions/execute_tool_calls), intersectingvox-capability-registrywithDirectToolExecutor— same MCP names asvox-mcp. Callers (CLI, daemons, tests) importvox_tools::mens_chatwhen they need OpenAI-style tool JSON or in-process execution.
Out of scope / deprecated
- whisper.cpp / ggml / clang STT: not supported in-tree; old plans under
.cursor/plans/that citewhispercpp.rsare historical — canonical STT is Candle invox-oratio.