"ADR 016: Oratio streaming Whisper and constrained decode"

ADR 016: Oratio streaming Whisper and constrained decode

Status

Accepted.

Oratio already supports offline Whisper transcription and chunked long-file processing. Product and extension flows require:

Keep Whisper/Candle as the default STT backend, and expose streaming over the wire using server-side partial events.
Implement constrained decode inside the decoder loop via a logit-processor hook.
Treat sub-second acoustic streaming as a quality/latency tradeoff mode, not a guarantee from stock Whisper.

Decoder hook: LogitProcessor in candle_engine, called before suppress-token masking and token selection.
Constraint tiers:
- additive hotword/lexicon token bias,
- explicit forbidden token masks,
- optional token-trie constraints for finite command vocab.
Streaming transport:
- vox-audio-ingress WebSocket endpoint (/api/audio/transcribe/stream) for PCM chunk ingest + partial/final events.
- MCP/clients discover streaming endpoint metadata via vox_oratio_status.

Positive:

Tradeoffs:

Token-trie constraints are approximate because BPE tokenization is not character-grammar exact.
True low-latency partials may regress WER vs full-window decode.
Single-process model mutex still limits concurrent decode sessions.

Add VAD-gated incremental decode policy knobs for production defaults.
Add nightly/e2e streaming tests with deterministic fixtures.
Evaluate alternate streaming ASR backend behind the same ingress contract if latency SLA requires it.