Mens local serving SSOT (Schola + orchestrator)
What this page is for
After vox mens train / vox-schola train (Candle QLoRA, default), the supported local inference server is vox-schola serve (also reached via vox mens serve --model <run_dir>, which spawns vox-schola). It loads the run directory (candle_qlora_adapter.safetensors, tokenizer.json, shards) and exposes:
POST /v1/chat/completions— OpenAI Chat CompletionsPOST /api/chat— Ollama-shaped chat (used by MCPvox-mcpwhen the provider is Ollama)POST /api/generate— Ollama-shaped generate (required forvox-ludusstreaming andvox-runtimePopuliClient::generate)GET /api/tags— model list for probesGET /api/version— JSON including acudahint when--deviceis CUDA (for capability probes)POST /api/embeddings— 501 (not implemented; use Ollama.app or another stack for embeddings)
This is not the same process as Ollama.app on http://localhost:11434, but it speaks a compatible subset of Ollama HTTP so you can point POPULI_URL (or OLLAMA_URL) at Schola’s listen address.
Quick start
- Train (example):
vox mens train --device cuda --output-dir mens/runs/latest - Serve:
vox-schola serve --model mens/runs/latest --port 11435 --model-name my-mens
(orvox mens serve --model mens/runs/latestwith the same effective flags where forwarded) - Point clients at Schola:
POPULI_URL=http://127.0.0.1:11435(precedence overOLLAMA_URL; seevox_config::inference::local_ollama_populi_base_url)POPULI_MODEL=my-mensmust match the name returned byGET /api/tags(Schola’s--model-name, else the run directory’s final path component)
Orchestrator and agent-to-agent
The in-tree orchestrator’s AiTaskProcessor uses vox_ludus::FreeAiClient, which calls POST …/api/generate for the local Ollama lane. Schola implements /api/generate, so orchestrator streaming works when POPULI_URL targets Schola.
Vox.toml [mesh] (or legacy [mens]) can record a stable inference base for operators and tooling:
[mesh]
control_url = "http://127.0.0.1:9847" # Populi mesh control plane (optional)
inference_base_url = "http://127.0.0.1:11435" # Schola or Ollama-shaped server
This maps to OrchestratorConfig::populi_inference_base_url. Processes still read POPULI_URL from the environment today: when starting workers or daemons, set POPULI_URL to that value (or export VOX_ORCHESTRATOR_POPULI_INFERENCE_BASE_URL and copy into POPULI_URL in your launcher). The config field is the SSOT for the intended URL in workspace TOML.
The default model registry uses POPULI_MODEL for the local Ollama provider entry (ModelConfig::default); keep it aligned with Schola’s advertised model id.
MCP
MCP’s Ollama bridge uses POST /api/chat, which Schola already supported. With OLLAMA_HOST or equivalent base URL pointing at Schola, MCP and Schola interoperate without code changes.
Machine-readable handoff
Training completion writes external_serving_handoff_v1.json in the run directory (schema: contracts/eval/external-serving-handoff.schema.json). vox mens merge-qlora / vox-schola merge write the same filename next to the merged shard’s parent directory for external (vLLM / HF / Ollama import) workflows.
Burn vox mens serve (execution-api)
A separate, Burn checkpoint HTTP server exists behind execution-api for *.bin / merge-weights artifacts. That path is not the default QLoRA story; prefer Schola for trained QLoRA runs. See Mens native training SSOT for the train → merge → serve matrix.