Mens vision and multimodal inputs (research 2026)
Executive summary
Vox today separates three layers that are easy to conflate:
- Orchestrator model selection — Remote catalogs (for example OpenRouter) expose
supports_visionwhen upstream reports image input modalities. Prompt text can also trigger heuristics (infer_prompt_capability_hintsinvox-orchestrator). - Native Mens Candle QLoRA and
vox mens serve/ Schola — Decoder-only text generation with a Hugging Face tokenizer; no in-tree image encoder in the Candle inference engine. - Mens training JSONL —
TrainingPairinvox-tensorcarries UTF-8 strings only (prompt,response, optionalturns[].content). There is no first-class attachment field today.
Recommendation: Treat vision as an optional evidence pipeline that produces small structured JSON (rubric output, layout hashes, a11y snapshots) beside compiler metrics. Route raw multimodal inference to remote VLMs until TrainingPair (or a successor row type) and loaders are explicitly versioned and bounded.
Ground truth in repository
| Concern | Location / behavior |
|---|---|
| Text-only inference enum | vox-populi: InferenceModel (Qwen2 / Qwen35 variants) in candle_inference_serve.rs — autoregressive text, KV cache, no vision tower. |
| JSONL row shape | vox-tensor data.rs: TrainingPair — no image_url, mime, or bytes_sha256 fields. |
| Vision routing heuristics | vox-orchestrator dei_shim/selection/resolve.rs: substring-based (requires_vision, requires_web_search) from prompt text only. |
| OpenRouter vision flag | vox-orchestrator catalog.rs: supports_vision from architecture.input_modalities containing "image". |
| Compiler + golden gate | vox-compiler tests golden_vox_examples.rs — parse, HIR, WebIR validate, Syntax-K; unrelated to pixels. |
| Screenshot / browser | vox-runtime browser builtins; MCP browser_screenshot — pixels leave the trust boundary unless policy wraps them. |
Design directions
A. Agent-to-agent handoff (near-term, low coupling)
- Coding agent produces
.voxand compiler diagnostics (orVoxIrModulepath when emitted). - Vision specialist (remote VLM) receives screenshot + fixed rubric and returns JSON validated against a small JSON Schema (widget list, visible errors, primary CTA, route hint).
- Store
vision_rubric.jsonkeyed byfixture_idandsha3(screenshot bytes)next to corpus batch reports; do not embed raw pixels in git-tracked JSONL.
B. Explicit task hints (orchestrator)
- Prefer client-supplied
requires_visionand anattachment_manifest(MIME type, content hash, optional URI) over substring inference for high-stakes routes. - When heuristics are used, log
hint_source: heuristicvsexplicitfor later evaluation.
C. TrainingPair v2 (research schema, not implemented here)
Document-only requirements for a future serde shape:
- Optional
attachments: [{ kind, mime, sha256, max_bytes, redaction_tier }]. - Version field
training_pair_schemafor loaders (VOX_MENS_TRAIN_JSONL_STRICT=1behavior must be defined per version). - Interaction with HF chat templates for Qwen-class VL models (special image tokens) — see mens-qwen-family-migration-research-2026.md and Hugging Face
Qwen3_5Configmultimodal token ids in upstream docs.
D. Cheaper than VL where possible
- Playwright accessibility tree or DOM snapshot JSON may answer many “what is on screen?” questions without a VLM; compare cost and flakiness before defaulting to vision models in CI.
Privacy, telemetry, artifacts
- Raw screenshots are workspace artifacts — follow workspace artifact retention and
vox ci artifact-auditguidance in contributor governance. - Any telemetry row that references vision must avoid embedding image bytes; align with telemetry trust SSOT and opt-in persistence flags.
See also
- GUI, v0/islands, vision, and Mens Qwen — virtuous-cycle implementation plan (2026) — execution waves and 50+ concrete work items.
- Vox corpus lab (research 2026) — tiers, batch lanes, eval harness sketch.
- Mens Qwen family migration (research 2026) — text vs multimodal configs upstream.
- Mens training data contract —
validate-batch, quarantine, lanes. - Vox source → Mens pipeline SSOT — lexer vs HF tokenizer separation.
- Mens training SSOT / reference — Candle QLoRA-first, serve matrix.
Open questions
- Should
vox_vision_rubricbe a first-class mix lane inmens/config/mix.yaml, or a separate JSONL source consumed only by eval jobs? - Who owns JSON Schema for rubric output —
vox-corpus,vox-eval, orcontracts/eval/? - Minimum redaction rules before any screenshot hash is logged to
research_metrics.