"Mens vision and multimodal inputs (research 2026)"

Mens vision and multimodal inputs (research 2026)

Executive summary

Vox today separates three layers that are easy to conflate:

Orchestrator model selection — Remote catalogs (for example OpenRouter) expose supports_vision when upstream reports image input modalities. Prompt text can also trigger heuristics (infer_prompt_capability_hints in vox-orchestrator).
Native Mens Candle QLoRA and vox mens serve / Schola — Decoder-only text generation with a Hugging Face tokenizer; no in-tree image encoder in the Candle inference engine.
Mens training JSONL — TrainingPair in vox-tensor carries UTF-8 strings only (prompt, response, optional turns[].content). There is no first-class attachment field today.

Recommendation: Treat vision as an optional evidence pipeline that produces small structured JSON (rubric output, layout hashes, a11y snapshots) beside compiler metrics. Route raw multimodal inference to remote VLMs until TrainingPair (or a successor row type) and loaders are explicitly versioned and bounded.

Ground truth in repository

Concern	Location / behavior
Text-only inference enum	`vox-populi`: `InferenceModel` (`Qwen2` / `Qwen35` variants) in `candle_inference_serve.rs` — autoregressive text, KV cache, no vision tower.
JSONL row shape	`vox-tensor` `data.rs`: `TrainingPair` — no `image_url`, `mime`, or `bytes_sha256` fields.
Vision routing heuristics	`vox-orchestrator` `dei_shim/selection/resolve.rs`: substring-based `(requires_vision, requires_web_search)` from prompt text only.
OpenRouter vision flag	`vox-orchestrator` `catalog.rs`: `supports_vision` from `architecture.input_modalities` containing `"image"`.
Compiler + golden gate	`vox-compiler` tests `golden_vox_examples.rs` — parse, HIR, WebIR validate, Syntax-K; unrelated to pixels.
Screenshot / browser	`vox-runtime` browser builtins; MCP `browser_screenshot` — pixels leave the trust boundary unless policy wraps them.

Design directions

A. Agent-to-agent handoff (near-term, low coupling)

Coding agent produces .vox and compiler diagnostics (or VoxIrModule path when emitted).
Vision specialist (remote VLM) receives screenshot + fixed rubric and returns JSON validated against a small JSON Schema (widget list, visible errors, primary CTA, route hint).
Store vision_rubric.json keyed by fixture_id and sha3(screenshot bytes) next to corpus batch reports; do not embed raw pixels in git-tracked JSONL.

B. Explicit task hints (orchestrator)

Prefer client-supplied requires_vision and an attachment_manifest (MIME type, content hash, optional URI) over substring inference for high-stakes routes.
When heuristics are used, log hint_source: heuristic vs explicit for later evaluation.

C. `TrainingPair` v2 (research schema, not implemented here)

Document-only requirements for a future serde shape:

Optional attachments: [{ kind, mime, sha256, max_bytes, redaction_tier }].
Version field training_pair_schema for loaders (VOX_MENS_TRAIN_JSONL_STRICT=1 behavior must be defined per version).
Interaction with HF chat templates for Qwen-class VL models (special image tokens) — see mens-qwen-family-migration-research-2026.md and Hugging Face Qwen3_5Config multimodal token ids in upstream docs.

D. Cheaper than VL where possible

Playwright accessibility tree or DOM snapshot JSON may answer many “what is on screen?” questions without a VLM; compare cost and flakiness before defaulting to vision models in CI.

Privacy, telemetry, artifacts

Raw screenshots are workspace artifacts — follow workspace artifact retention and vox ci artifact-audit guidance in contributor governance.
Any telemetry row that references vision must avoid embedding image bytes; align with telemetry trust SSOT and opt-in persistence flags.

Open questions

Should vox_vision_rubric be a first-class mix lane in mens/config/mix.yaml, or a separate JSONL source consumed only by eval jobs?
Who owns JSON Schema for rubric output — vox-corpus, vox-eval, or contracts/eval/?
Minimum redaction rules before any screenshot hash is logged to research_metrics.

Vox: The AI-Native Programming Language