"Speech capture architecture (edge vs backend)"

Speech capture architecture

Principle

  • Edge / client: microphone, file drops, browser MediaRecorder, mobile native capture.
  • Backend: STT, refinement, routing, codegen, and HIR validation run where vox-oratio, vox-mcp, and vox-lsp validation can execute (developer machine, CI agent host, or container without requiring a container-attached mic).

Containers should not assume direct microphone device access; bind-mount a workspace directory or use HTTP upload instead.

Surfaces (canonical)

SurfaceRoleNotes
vox-audio-ingress binaryHTTP /api/audio/status, /api/audio/transcribe, /api/audio/transcribe/uploadBind via VOX_DASH_HOST / VOX_DASH_PORT; workspace root from VOX_ORATIO_WORKSPACE or CWD.
MCP vox_oratio_transcribe, vox_oratio_listenFile-path STT inside MCP workspaceCompatibility path for agents; same Oratio pipeline as CLI.
MCP vox_speech_to_codeOrchestration: path or text → vox_generate_code (+ optional emit_trace_path JSONL)Shares session_id / repair KPI metadata with codegen.
CLI vox oratio transcribe / listenFile + UX gatesFeature oratio.
CLI vox oratio record-transcribeDefault mic → temp WAV → transcribeFeature oratio-mic (cpal + hound).

OpenAPI mirror (Codex HTTP catalog): contracts/codex-api.openapi.yaml under /api/audio/*.

Platform clients (same contracts)

  • VS Code / Cursor (vox-vscode): Command Palette Vox: Oratio — … (vox.oratio.transcribeFile, vox.oratio.speechToCodeFile, vox.oratio.voiceCaptureTranscribe, vox.oratio.voiceCaptureSpeechToCode), Explorer context menu on audio files (case-insensitive extension match), plus onView:vox-sidebar.chat and onCommand entries for contributed vox.* commands (including Oratio and inline-edit keybindings) so MCP + speech work without *.vox in the workspace. Files already under the workspace use a relative MCP path; outside picks copy to .vox/tmp/. Voice capture encodes mono 16-bit PCM WAV in the webview before the same MCP calls. Alternatively POST audio to vox-audio-ingress when a shared HTTP endpoint is configured.
  • Browser / web: MediaRecorder (or file upload) → POST /api/audio/transcribe/upload (or finalize to disk and JSON transcribe in trusted environments).
  • Mobile: native capture → same upload contract; do not require the monorepo Docker image on-device (see mobile-edge-ai.md for inference ownership).

Trace and correlation

  • Generate correlation IDs with vox_oratio::trace::new_correlation_id() and pass session_id through MCP for chat/model affinity.
  • Optional emit_trace_path on vox_speech_to_code appends one JSON object per call; fields align with contracts/speech-to-code/speech_trace.schema.json (plus codegen_meta for tooling).