"Mens vision and multimodal inputs (research 2026)"

Mens vision and multimodal inputs (research 2026)

Executive summary

Vox today separates three layers that are easy to conflate:

  1. Orchestrator model selection — Remote catalogs (for example OpenRouter) expose supports_vision when upstream reports image input modalities. Prompt text can also trigger heuristics (infer_prompt_capability_hints in vox-orchestrator).
  2. Native Mens Candle QLoRA and vox mens serve / Schola — Decoder-only text generation with a Hugging Face tokenizer; no in-tree image encoder in the Candle inference engine.
  3. Mens training JSONLTrainingPair in vox-tensor carries UTF-8 strings only (prompt, response, optional turns[].content). There is no first-class attachment field today.

Recommendation: Treat vision as an optional evidence pipeline that produces small structured JSON (rubric output, layout hashes, a11y snapshots) beside compiler metrics. Route raw multimodal inference to remote VLMs until TrainingPair (or a successor row type) and loaders are explicitly versioned and bounded.

Ground truth in repository

ConcernLocation / behavior
Text-only inference enumvox-populi: InferenceModel (Qwen2 / Qwen35 variants) in candle_inference_serve.rs — autoregressive text, KV cache, no vision tower.
JSONL row shapevox-tensor data.rs: TrainingPair — no image_url, mime, or bytes_sha256 fields.
Vision routing heuristicsvox-orchestrator dei_shim/selection/resolve.rs: substring-based (requires_vision, requires_web_search) from prompt text only.
OpenRouter vision flagvox-orchestrator catalog.rs: supports_vision from architecture.input_modalities containing "image".
Compiler + golden gatevox-compiler tests golden_vox_examples.rs — parse, HIR, WebIR validate, Syntax-K; unrelated to pixels.
Screenshot / browservox-runtime browser builtins; MCP browser_screenshot — pixels leave the trust boundary unless policy wraps them.

Design directions

A. Agent-to-agent handoff (near-term, low coupling)

  • Coding agent produces .vox and compiler diagnostics (or VoxIrModule path when emitted).
  • Vision specialist (remote VLM) receives screenshot + fixed rubric and returns JSON validated against a small JSON Schema (widget list, visible errors, primary CTA, route hint).
  • Store vision_rubric.json keyed by fixture_id and sha3(screenshot bytes) next to corpus batch reports; do not embed raw pixels in git-tracked JSONL.

B. Explicit task hints (orchestrator)

  • Prefer client-supplied requires_vision and an attachment_manifest (MIME type, content hash, optional URI) over substring inference for high-stakes routes.
  • When heuristics are used, log hint_source: heuristic vs explicit for later evaluation.

C. TrainingPair v2 (research schema, not implemented here)

Document-only requirements for a future serde shape:

  • Optional attachments: [{ kind, mime, sha256, max_bytes, redaction_tier }].
  • Version field training_pair_schema for loaders (VOX_MENS_TRAIN_JSONL_STRICT=1 behavior must be defined per version).
  • Interaction with HF chat templates for Qwen-class VL models (special image tokens) — see mens-qwen-family-migration-research-2026.md and Hugging Face Qwen3_5Config multimodal token ids in upstream docs.

D. Cheaper than VL where possible

  • Playwright accessibility tree or DOM snapshot JSON may answer many “what is on screen?” questions without a VLM; compare cost and flakiness before defaulting to vision models in CI.

Privacy, telemetry, artifacts

  • Raw screenshots are workspace artifacts — follow workspace artifact retention and vox ci artifact-audit guidance in contributor governance.
  • Any telemetry row that references vision must avoid embedding image bytes; align with telemetry trust SSOT and opt-in persistence flags.

See also

Open questions

  1. Should vox_vision_rubric be a first-class mix lane in mens/config/mix.yaml, or a separate JSONL source consumed only by eval jobs?
  2. Who owns JSON Schema for rubric output — vox-corpus, vox-eval, or contracts/eval/?
  3. Minimum redaction rules before any screenshot hash is logged to research_metrics.