Speech capture architecture
Principle
- Edge / client: microphone, file drops, browser
MediaRecorder, mobile native capture. - Backend: STT, refinement, routing, codegen, and HIR validation run where
vox-oratio,vox-mcp, andvox-lspvalidation can execute (developer machine, CI agent host, or container without requiring a container-attached mic).
Containers should not assume direct microphone device access; bind-mount a workspace directory or use HTTP upload instead.
Surfaces (canonical)
| Surface | Role | Notes |
|---|---|---|
vox-audio-ingress binary | HTTP /api/audio/status, /api/audio/transcribe, /api/audio/transcribe/upload | Bind via VOX_DASH_HOST / VOX_DASH_PORT; workspace root from VOX_ORATIO_WORKSPACE or CWD. |
MCP vox_oratio_transcribe, vox_oratio_listen | File-path STT inside MCP workspace | Compatibility path for agents; same Oratio pipeline as CLI. |
MCP vox_speech_to_code | Orchestration: path or text → vox_generate_code (+ optional emit_trace_path JSONL) | Shares session_id / repair KPI metadata with codegen. |
CLI vox oratio transcribe / listen | File + UX gates | Feature oratio. |
CLI vox oratio record-transcribe | Default mic → temp WAV → transcribe | Feature oratio-mic (cpal + hound). |
OpenAPI mirror (Codex HTTP catalog): contracts/codex-api.openapi.yaml under /api/audio/*.
Platform clients (same contracts)
- VS Code / Cursor (
vox-vscode): Command Palette Vox: Oratio — … (vox.oratio.transcribeFile,vox.oratio.speechToCodeFile,vox.oratio.voiceCaptureTranscribe,vox.oratio.voiceCaptureSpeechToCode), Explorer context menu on audio files (case-insensitive extension match), plusonView:vox-sidebar.chatandonCommandentries for contributedvox.*commands (including Oratio and inline-edit keybindings) so MCP + speech work without*.voxin the workspace. Files already under the workspace use a relative MCPpath; outside picks copy to.vox/tmp/. Voice capture encodes mono 16-bit PCM WAV in the webview before the same MCP calls. Alternatively POST audio tovox-audio-ingresswhen a shared HTTP endpoint is configured. - Browser / web:
MediaRecorder(or file upload) →POST /api/audio/transcribe/upload(or finalize to disk and JSON transcribe in trusted environments). - Mobile: native capture → same upload contract; do not require the monorepo Docker image on-device (see
mobile-edge-ai.mdfor inference ownership).
Trace and correlation
- Generate correlation IDs with
vox_oratio::trace::new_correlation_id()and passsession_idthrough MCP for chat/model affinity. - Optional
emit_trace_pathonvox_speech_to_codeappends one JSON object per call; fields align withcontracts/speech-to-code/speech_trace.schema.json(pluscodegen_metafor tooling).