TOESTUB self-healing architecture 2026

"TOESTUB self-healing architecture 2026"

This page is the research-backed SSOT for evolving TOESTUB from a regex-heavy static checker into a self-healing, self-protecting, LLM-aware quality system that feeds negative patterns into Populi/MENS training.

Why this exists

TOESTUB already has strong primitives (TokenMap, structured suppressions, run modes, schema contracts), but stub detection is still mostly literal and line-pattern driven. That shape is good for speed but weak for semantic unfinished-work detection and weak for continuous model feedback loops.

External research synthesis (2026)

What top systems do well

Ruff: performance-first unified toolchain, built-in caching, cascading monorepo config, broad rule coverage, fast autofix loops.
Sources: Ruff docs, Ruff FAQ, Ruff configuration discovery.
rust-analyzer + Salsa: lazy + incremental query graph with durability tiers and architecture invariants around API boundaries.
Sources: Architecture, Three architectures blog, Durable incrementality.
Trunk Code Quality: hermetic runtime/tool management, daemonized background precompute, hold-the-line gating, git-aware partial scans, plugin extensibility.
Sources: Trunk code-quality overview, Trunk plugins.
CodeQL: semantic extraction into queryable databases, path-problem traces, variant analysis at scale.
Sources: About CodeQL, About queries, Path queries.
Semgrep: practical custom-rule authoring with cross-file/cross-function dataflow and mature language support matrix.
Sources: Semgrep docs, Feature definitions, Language maturity summary.
Biome / Clippy / golangci-lint: explicit safe-vs-unsafe fixes, rule domains/categories, rich suppression and false-positive controls, large-scale runner ergonomics.
Sources: Biome linter, Clippy docs, golangci-lint false positives.

Most relevant imported patterns for TOESTUB

Durable incremental analysis (rust-analyzer): volatile user files vs durable generated/vendor/config domains.
Hermetic reproducibility (Trunk/Ruff): deterministic tool/rule/runtime versions in CI and local.
Path/evidence explainability (CodeQL): structured evidence and optional path traces, not only plain-text rule messages.
Rule lifecycle governance (Biome/Clippy): experimental -> shadow -> recommended -> strict.
Hold-the-line rollout (Trunk/golangci-lint): strict on new deltas, gradual cleanup of legacy baseline.
Config and suppression discipline (Ruff/golangci-lint): policy in data contracts, not ad hoc in detector code.

Current TOESTUB architectural baseline (in-repo)

Engine orchestrates scan -> per-file parse -> detector pass in crates/vox-toestub/src/engine.rs.
Rust lexical classification for comments/strings in crates/vox-toestub/src/analysis/token_map.rs.
Stub detector in crates/vox-toestub/src/detectors/stub.rs still relies on many lexical markers and local exceptions.
Scanner exclusions in crates/vox-toestub/src/scanner.rs.
Existing reporting/snapshot contracts in:

Target architecture (self-healing TOESTUB)

flowchart TD
  sourceTree[WorkspaceSourceTree] --> scanner[Scanner]
  scanner --> fileIndex[FileIndexDurabilityTiered]
  fileIndex --> analysisCache[AnalysisContextCache]
  analysisCache --> lexical[LexicalFeatures]
  analysisCache --> ast[ASTFeatures]
  analysisCache --> graph[CallRefGraphFeatures]
  analysisCache --> history[HistoricalFindingFeatures]
  lexical --> scorer[EvidenceScoringModel]
  ast --> scorer
  graph --> scorer
  history --> scorer
  scorer --> findings[FindingsWithConfidenceEvidence]
  findings --> policy[PolicyGateThresholds]
  policy --> fixer[SafeUnsafeFixPlanner]
  fixer --> verify[TargetedVerification]
  verify --> learn[FeedbackCalibrationLoop]
  learn --> populi[PopuliNegativePatternFeed]
  populi --> mens[MENSTrainingCorpus]

Do and do-not rules (LLM maintainability critical path)

Do

Keep detector logic deterministic and policy-driven through contract files.
Emit machine-usable evidence for each finding (confidence, evidence_kind, feature_values).
Separate fast lexical checks from slower semantic checks behind staged gates.
Require targeted verification before any autofix lands.
Keep suppressions structured, owner-tagged, and expiry-aware.
Maintain strict JSON schema versioning for all new TOESTUB outputs consumed by CI/MENS pipelines.

Do not

Do not expand keyword lists indefinitely to chase false negatives.
Do not bury exception logic as in-code one-off skips; move to policy contracts.
Do not auto-apply unsafe fixes in CI.
Do not couple Populi/MENS ingestion directly to volatile internal structs; use explicit versioned contracts.
Do not regress rust_parse_failures budget for feature expansion.

LLM-specific anti-pattern taxonomy (for TOESTUB v2)

TOESTUB should detect these as first-class families, not just text tokens:

No-op implementation shells: function exists, but no side effects, no state transition, no meaningful return.
Behavior-claim mismatch: comments/docs claim completion while implementation evidence is thin.
Hallucinated call surfaces: unresolved callsites with near-neighbor symbol hints indicating probable LLM fabrication.
Adapter-only pass-through chains: wrappers that only relay inputs without semantic contribution across multiple layers.
Dead branch saturation: complex conditionals with trivial branch bodies.
Synthetic constant clusters: hard-coded values introduced in bulk edits without central policy references.
Pseudo-refactors: renamed symbols with stale references across sibling modules.

TOESTUB tables in toestub_build domain:
- toestub_task_queue
- toestub_baselines
- toestub_file_cache
- toestub_suppressions
- Source: crates/vox-db/src/schema/domains/toestub_build.rs
Generic telemetry/event table:
- research_metrics(session_id, metric_type, metric_value, metadata_json, created_at)
- Source: crates/vox-db/src/schema/domains/agents.rs
Existing event-writing patterns:
- benchmark_event via record_benchmark_event
- populi_control_event via record_populi_control_event

Proposed persistence model

Run-level telemetry (reuse research_metrics, no new table initially)
- session_id: toestub:<repository_id>
- metric_type:
  - toestub_run_summary
  - toestub_rule_quality
  - toestub_remediation_outcome
  - toestub_training_feedback_export
- metric_value: compact KPI (for example, precision estimate or runtime_ms normalized scalar)
- metadata_json: structured payload containing run ids, policy digest, confidence histograms, FP/FN counters, remediation class totals, and export ids.
State snapshots (reuse TOESTUB tables)
- Keep full findings snapshots in toestub_baselines.findings_json.
- Keep fix queue snapshots in toestub_task_queue.fix_suggestions_json.
- Keep per-file detector cache in toestub_file_cache.
Minimal additive extensions (preferred over new tables)
- Add optional fields to existing TOESTUB tables for reproducibility and joins:
  - run_id
  - policy_digest
  - rules_digest
  - engine_mode (legacy/shadow/v2)
- If adding columns is too disruptive for immediate rollout, include these in embedded JSON first, then promote to columns in a later schema baseline.

Why this is preferred

avoids introducing yet another event table,
matches existing VoxDB telemetry conventions,
keeps compatibility with Codex/MCP readers already consuming research_metrics,
allows gradual hardening from JSON payloads to typed columns only where query pressure justifies it.

Query and maintenance guardrails

Add lightweight helper APIs in vox-db similar to record_benchmark_event:
- record_toestub_run_summary
- record_toestub_rule_quality
- record_toestub_remediation_outcome
Keep payload schema versioned in JSON (schema_version) -> avoid brittle readers.
Enforce retention/cleanup policy for noisy run telemetry (avoid unbounded growth).
Never store raw secrets or full file contents in telemetry payloads.

Integration strategy

Add a TOESTUB export contract for training feedback, e.g. contracts/toestub/training-feedback.v1.schema.json.
Emit records with:
- rule_family
- confidence
- anonymized structural features
- optional minimal code window
- fix class (safe, review_required, reject)
- outcome label after human/CI adjudication
In Populi pipeline, map these records into:
- negative pattern rows (what to avoid),
- counterexample rows (preferred correction patterns),
- trajectory labels for recovery behavior.

Existing docs to align

Evolution model (converge to SSOT, avoid magic values)

Use a contract-first control surface:

stub-policy.v1.json: score weights, thresholds, risk multipliers.
suppression.v1.schema.json: keep owner/reason/expiry strict.
training-feedback.v1.json: immutable event feed to Populi.
toestub-run-json.v2.schema.json: add optional evidence summary and calibration stats.

Policy knobs should be loaded dynamically and fingerprinted in output metadata so runs are reproducible and auditable.

Adoption stages

Stage 0 (shadow): new scorer runs in parallel, no gate effect.
Stage 1 (assist): emits warnings with confidence/evidence.
Stage 2 (balanced gate): high-confidence errors gate, medium-confidence warnings annotate.
Stage 3 (self-heal safe): safe autofixes enabled with targeted verification.
Stage 4 (training loop): Populi ingestion drives calibrated threshold updates under governance.

Architecture risks and mitigations

Risk: semantic scoring increases runtime.
Mitigation: two-phase pipeline; skip deep analysis for low-signal files.
Risk: overfitting to current codebase patterns.
Mitigation: maintain curated TP/FP/FN fixtures + periodic drift review.
Risk: unsafe auto-remediation regressions.
Mitigation: safe/unsafe fix classes + mandatory targeted tests + rollback.
Risk: training data poisoning from noisy findings.
Mitigation: ingest only adjudicated findings with confidence and outcome labels.
Risk: event payload sprawl in generic research_metrics.
Mitigation: strict payload schemas, version tags, and promotion of only high-value fields into typed columns.
Risk: schema churn from over-eager normalization.
Mitigation: JSON-first for early iterations, then additive columns on proven query paths only.

Minimal success metrics (first promotion)

stub/placeholder false-positive rate reduced by at least 40% vs current baseline.
No increase in rust_parse_failures.
Mean TOESTUB runtime increase <= 20% for crates/ scan in audit mode.
At least one Populi ingestion path operational with schema-validated training feedback export.

References

Ruff: docs, FAQ
rust-analyzer: architecture, incrementality
Trunk Code Quality: overview
CodeQL: about, path queries
Semgrep: docs, feature definitions
Biome: linter
Clippy: docs
golangci-lint: configuration, false positives

Vox: The AI-Native Programming Language