"Telemetry and research_metrics contract"

Telemetry & research_metrics contract

Code enforcement for row validation: validate_research_metric_row (called from append_research_metric). Repository-scoped producers should use TelemetryWriteOptions plus the METRIC_TYPE_* / SESSION_PREFIX_* / SESSION_ID_* constants in vox_db::research_metrics_contract.

Row shape

Table research_metrics columns: session_id, metric_type, metric_value (nullable REAL), metadata_json.

  • metric_value: optional scalar. SQL NULL means “no scalar” — APIs must not coerce NULL to 0.0 (aggregations skip nulls; see list_research_metrics_by_type).
  • metadata_json: structured payload; may include units and names that disambiguate mixed benchmarks.

Validation limits (writes)

FieldRule
session_idNon-empty; max 512 UTF-8 characters.
metric_typeNon-empty; max 128 characters; characters must be ASCII alphanumeric or _, ., -, : (colon allows MCP-linked namespaces such as foo:bar).
metadata_jsonOptional; if present, max 256 KiB serialized length.

Session id namespaces (convention)

Producers should prefix session_id so rollups and dashboards can group without colliding:

PrefixExampleTypical producer
bench:bench:<repository_id>CLI / build timings
syntaxk:syntaxk:<repository_id>Syntax-K eval fixtures
mcp:mcp:<repository_id>MCP Socrates / surface telemetry
mens:mens:<repository_id>Populi control-plane audit (populi_control_event)
workflow:workflow:<repository_id>Interpreted workflow journal (workflow_journal_entry, versioned event payloads from the workflow durability contract)

Fixed session (no repository in id): hybrid memory fusion uses session socrates:retrieval and metric type memory_hybrid_fusion (see SESSION_ID_MEMORY_HYBRID_FUSION in the Rust module).

Questioning / linked metrics: MCP may use opaque session_key strings for questioning_event and vox_db_research_metric_linked (not forced through TelemetryWriteOptions); those rows still must satisfy validation caps above.

Metric types (non-exhaustive)

metric_typeSession prefixScalar semanticsNotes
benchmark_eventbench:<repository_id>Optional; unit in metadata metric_value_unitCLI build timings use seconds for wall time.
syntax_k_eventsyntaxk:<repository_id>Optional ratio / timingFixture id in metadata; optional support_metrics (representability / LLM surface / runtime projection summaries per contracts/eval/syntax-k-event.schema.json).
socrates_surfacemcp:<repository_id>Hallucination-risk proxyPrefer metadata for interpretability; eval summaries inject explicit denominators (below).

socrates_surface aggregate metadata (record_socrates_eval_summary)

Rollups written to eval_runs include JSON with both raw counts and explicit denominators so downstream tools do not misread rates when some rows lack a scalar or parseable metadata:

  • rate_denominator: literal "parsed_metadata_rows" — rates (answer_rate, abstain_rate) use this count.
  • abstain_rate_denominator_n / answer_rate_denominator_n: same as parsed_metadata_rows.
  • mean_proxy_denominator_n: rows_with_metric_value — mean hallucination-risk proxy uses only rows where metric_value was non-NULL.
  • rows_total_n: sample_size — all socrates_surface rows scanned.

Quality in eval_runs uses the mean proxy only when rows_with_metric_value > 0; otherwise quality is 0.0 (avoids implying a perfect score with no scalar signal).

benchmark_event metadata (BenchmarkEventMeta)

  • name: logical benchmark id (cargo_build_metrics, …).
  • metric_value_unit: when metric_value is set, unit SSOT (seconds, milliseconds, ratio, …).
  • details: free-form JSON (per-crate timings, pass/fail flags).

Build timing producers (current)

  • vox ci build-timings (shallow lanes) writes benchmark_event name ci_build_timings with:
    • metric_value: total wall time in seconds,
    • metric_value_unit: seconds,
    • details: lane rows (lane, ok, ms) plus total_ms.
  • vox ci build-timings --deep writes structured rows to build_run / build_crate_sample / build_warning; on structured-write fallback it writes benchmark_event name cargo_build_metrics with metric_value_unit = seconds.
  • VOX_BENCHMARK_TELEMETRY=1 controls benchmark_event writes; structured build_* writes follow command persistence settings and VoxDB availability.

For cross-repo querying via MCP, benchmark_event may use name = "cross_repo_query" with metric_value_unit = "milliseconds" and details such as:

  • query_kind
  • trace_id
  • correlation_id
  • conversation_id
  • workspace_repository_id
  • target_repository_ids
  • source_plane
  • query_backend
  • result_count
  • skipped_count

Training JSONL (telemetry.jsonl)

Envelope per line: { "ts_ms", "event", "payload" }. Payload keys are defined in crates/vox-populi/src/mens/tensor/telemetry_schema.rs (e.g. eta_seconds_remaining, steps_per_sec_ema). The CLI viewer vox mens watch-telemetry must track this schema (guarded by vox ci data-ssot-guards).

Mens training KPI ownership (decision-driving)

  • Tier 1 (gate-driving):
    • tokens_per_sec (with tokens_per_sec_is_proxy when derived),
    • valid_tokens,
    • theoretical_tokens,
    • supervised_ratio_pct.
  • Tier 2 (diagnostic):
    • steps_per_sec_ema,
    • eta_seconds_remaining,
    • skip counters (skip_no_supervised_positions, skip_short_seq, ...).

Deprecation / compatibility window

  • Consumers should prefer canonical fields above.
  • Legacy aliases are still read with warnings (status / eval-gate paths), then normalized at read time.
  • steps_per_sec_ema as a throughput surrogate is considered deprecated for gates when tokens_per_sec is present.

CI

  • vox ci data-ssot-guards — asserts watch-telemetry references schema keys and research_metrics list API avoids COALESCE(metric_value, 0.0).
  • Web IR structural gate: workflow sets VOX_WEBIR_VALIDATE=1 and runs cargo test -p vox-compiler --test web_ir_lower_emit (see .github/workflows/ci.yml).