Qwen3.5 Multimodal Phase 2 Backlog
This backlog starts only after native text Qwen3.5 support is green in CI/dogfood.
Scope boundary
- Phase 1 (current): native text-only Qwen3.5 (
0.8B/2B/4B/9B) in train/merge/serve/gates. - Phase 2 (this backlog): add multimodal (vision/video token path) for training and inference.
Work items
-
Config and model layout extension
- Extend multimodal config parsing in
crates/vox-populi/src/mens/tensor/hf_load.rsforvision_configand token ids (vision_start_token_id,vision_end_token_id,image_token_id,video_token_id). - Add explicit architecture guard in preflight for text-only vs multimodal checkpoints.
- Extend multimodal config parsing in
-
Data contract and corpus pipeline
- Extend
vox_tensor::data::TrainingPaircontract to include multimodal payload references and modality tags. - Add corpus extract/mix validation for multimodal source rows (required files, max media size, decode status).
- Add deterministic JSONL schema checks in
vox-clicorpus commands to reject malformed multimodal rows early.
- Extend
-
Trainer graph integration
- Add multimodal embedding ingestion in
crates/vox-populi/src/mens/tensor/candle_qlora_train/mod.rswith strict feature gating. - Thread modality-aware masking and sequence assembly through training loop and validation.
- Update manifest fields to include modality counters and multimodal preflight status.
- Add multimodal embedding ingestion in
-
Inference serve path
- Extend
crates/vox-populi/src/mens/tensor/candle_inference_serve.rsto accept multimodal prompt payloads. - Add modality-aware tokenization/packing and guardrails when requested modality is unsupported by loaded checkpoint.
- Extend
-
Merge and artifact compatibility
- Extend adapter metadata schema for multimodal capability flags.
- Add merge validation for multimodal-sensitive keys and reject incomplete merges for multimodal checkpoints.
-
CI and regression coverage
- Add synthetic multimodal fixture tests in
crates/vox-populi/tests. - Add CI contract checks for multimodal schema + parser + preflight gates (without requiring large media artifacts).
- Add optional nightly multimodal smoke for short-run finite-loss and artifact checks on GPU runners.
- Add synthetic multimodal fixture tests in
Exit criteria for Phase 2
- Multimodal preflight rejects bad checkpoints/data with actionable diagnostics.
- Multimodal train path runs with finite loss and checkpoints in nightly smoke.
- Serve path can load multimodal-enabled artifacts and run basic generation.
- CI includes deterministic multimodal contract tests and no regressions in text-only Qwen3.5 paths.