TOESTUB self-healing architecture 2026
This page is the research-backed SSOT for evolving TOESTUB from a regex-heavy static checker into a self-healing, self-protecting, LLM-aware quality system that feeds negative patterns into Populi/MENS training.
Why this exists
TOESTUB already has strong primitives (TokenMap, structured suppressions, run modes, schema contracts), but stub detection is still mostly literal and line-pattern driven. That shape is good for speed but weak for semantic unfinished-work detection and weak for continuous model feedback loops.
External research synthesis (2026)
What top systems do well
- Ruff: performance-first unified toolchain, built-in caching, cascading monorepo config, broad rule coverage, fast autofix loops.
Sources: Ruff docs, Ruff FAQ, Ruff configuration discovery. - rust-analyzer + Salsa: lazy + incremental query graph with durability tiers and architecture invariants around API boundaries.
Sources: Architecture, Three architectures blog, Durable incrementality. - Trunk Code Quality: hermetic runtime/tool management, daemonized background precompute, hold-the-line gating, git-aware partial scans, plugin extensibility.
Sources: Trunk code-quality overview, Trunk plugins. - CodeQL: semantic extraction into queryable databases, path-problem traces, variant analysis at scale.
Sources: About CodeQL, About queries, Path queries. - Semgrep: practical custom-rule authoring with cross-file/cross-function dataflow and mature language support matrix.
Sources: Semgrep docs, Feature definitions, Language maturity summary. - Biome / Clippy / golangci-lint: explicit safe-vs-unsafe fixes, rule domains/categories, rich suppression and false-positive controls, large-scale runner ergonomics.
Sources: Biome linter, Clippy docs, golangci-lint false positives.
Most relevant imported patterns for TOESTUB
- Durable incremental analysis (rust-analyzer): volatile user files vs durable generated/vendor/config domains.
- Hermetic reproducibility (Trunk/Ruff): deterministic tool/rule/runtime versions in CI and local.
- Path/evidence explainability (CodeQL): structured evidence and optional path traces, not only plain-text rule messages.
- Rule lifecycle governance (Biome/Clippy):
experimental -> shadow -> recommended -> strict. - Hold-the-line rollout (Trunk/golangci-lint): strict on new deltas, gradual cleanup of legacy baseline.
- Config and suppression discipline (Ruff/golangci-lint): policy in data contracts, not ad hoc in detector code.
Current TOESTUB architectural baseline (in-repo)
- Engine orchestrates scan -> per-file parse -> detector pass in
crates/vox-toestub/src/engine.rs. - Rust lexical classification for comments/strings in
crates/vox-toestub/src/analysis/token_map.rs. - Stub detector in
crates/vox-toestub/src/detectors/stub.rsstill relies on many lexical markers and local exceptions. - Scanner exclusions in
crates/vox-toestub/src/scanner.rs. - Existing reporting/snapshot contracts in:
Target architecture (self-healing TOESTUB)
flowchart TD
sourceTree[WorkspaceSourceTree] --> scanner[Scanner]
scanner --> fileIndex[FileIndexDurabilityTiered]
fileIndex --> analysisCache[AnalysisContextCache]
analysisCache --> lexical[LexicalFeatures]
analysisCache --> ast[ASTFeatures]
analysisCache --> graph[CallRefGraphFeatures]
analysisCache --> history[HistoricalFindingFeatures]
lexical --> scorer[EvidenceScoringModel]
ast --> scorer
graph --> scorer
history --> scorer
scorer --> findings[FindingsWithConfidenceEvidence]
findings --> policy[PolicyGateThresholds]
policy --> fixer[SafeUnsafeFixPlanner]
fixer --> verify[TargetedVerification]
verify --> learn[FeedbackCalibrationLoop]
learn --> populi[PopuliNegativePatternFeed]
populi --> mens[MENSTrainingCorpus]
Do and do-not rules (LLM maintainability critical path)
Do
- Keep detector logic deterministic and policy-driven through contract files.
- Emit machine-usable evidence for each finding (
confidence,evidence_kind,feature_values). - Separate fast lexical checks from slower semantic checks behind staged gates.
- Require targeted verification before any autofix lands.
- Keep suppressions structured, owner-tagged, and expiry-aware.
- Maintain strict JSON schema versioning for all new TOESTUB outputs consumed by CI/MENS pipelines.
Do not
- Do not expand keyword lists indefinitely to chase false negatives.
- Do not bury exception logic as in-code one-off skips; move to policy contracts.
- Do not auto-apply unsafe fixes in CI.
- Do not couple Populi/MENS ingestion directly to volatile internal structs; use explicit versioned contracts.
- Do not regress
rust_parse_failuresbudget for feature expansion.
LLM-specific anti-pattern taxonomy (for TOESTUB v2)
TOESTUB should detect these as first-class families, not just text tokens:
- No-op implementation shells: function exists, but no side effects, no state transition, no meaningful return.
- Behavior-claim mismatch: comments/docs claim completion while implementation evidence is thin.
- Hallucinated call surfaces: unresolved callsites with near-neighbor symbol hints indicating probable LLM fabrication.
- Adapter-only pass-through chains: wrappers that only relay inputs without semantic contribution across multiple layers.
- Dead branch saturation: complex conditionals with trivial branch bodies.
- Synthetic constant clusters: hard-coded values introduced in bulk edits without central policy references.
- Pseudo-refactors: renamed symbols with stale references across sibling modules.
Populi + MENS integration avenue
Objective
Use TOESTUB findings to generate negative training patterns and policy hardening examples so MENS learns to avoid recurrent LLM failure modes.
VoxDB persistence design (explicit)
This architecture should persist detector and remediation outcomes in VoxDB by reusing existing schema surfaces first, with minimal additive columns where needed.
Existing scaffolding to reuse
- TOESTUB tables in
toestub_builddomain:toestub_task_queuetoestub_baselinestoestub_file_cachetoestub_suppressions- Source:
crates/vox-db/src/schema/domains/toestub_build.rs
- Generic telemetry/event table:
research_metrics(session_id, metric_type, metric_value, metadata_json, created_at)- Source:
crates/vox-db/src/schema/domains/agents.rs
- Existing event-writing patterns:
benchmark_eventviarecord_benchmark_eventpopuli_control_eventviarecord_populi_control_event
Proposed persistence model
- Run-level telemetry (reuse
research_metrics, no new table initially)session_id:toestub:<repository_id>metric_type:toestub_run_summarytoestub_rule_qualitytoestub_remediation_outcometoestub_training_feedback_export
metric_value: compact KPI (for example, precision estimate or runtime_ms normalized scalar)metadata_json: structured payload containing run ids, policy digest, confidence histograms, FP/FN counters, remediation class totals, and export ids.
- State snapshots (reuse TOESTUB tables)
- Keep full findings snapshots in
toestub_baselines.findings_json. - Keep fix queue snapshots in
toestub_task_queue.fix_suggestions_json. - Keep per-file detector cache in
toestub_file_cache.
- Keep full findings snapshots in
- Minimal additive extensions (preferred over new tables)
- Add optional fields to existing TOESTUB tables for reproducibility and joins:
run_idpolicy_digestrules_digestengine_mode(legacy/shadow/v2)
- If adding columns is too disruptive for immediate rollout, include these in embedded JSON first, then promote to columns in a later schema baseline.
- Add optional fields to existing TOESTUB tables for reproducibility and joins:
Why this is preferred
- avoids introducing yet another event table,
- matches existing VoxDB telemetry conventions,
- keeps compatibility with Codex/MCP readers already consuming
research_metrics, - allows gradual hardening from JSON payloads to typed columns only where query pressure justifies it.
Query and maintenance guardrails
- Add lightweight helper APIs in
vox-dbsimilar torecord_benchmark_event:record_toestub_run_summaryrecord_toestub_rule_qualityrecord_toestub_remediation_outcome
- Keep payload schema versioned in JSON (
schema_version) -> avoid brittle readers. - Enforce retention/cleanup policy for noisy run telemetry (avoid unbounded growth).
- Never store raw secrets or full file contents in telemetry payloads.
Integration strategy
- Add a TOESTUB export contract for training feedback, e.g.
contracts/toestub/training-feedback.v1.schema.json. - Emit records with:
rule_familyconfidence- anonymized structural features
- optional minimal code window
- fix class (
safe,review_required,reject) - outcome label after human/CI adjudication
- In Populi pipeline, map these records into:
- negative pattern rows (what to avoid),
- counterexample rows (preferred correction patterns),
- trajectory labels for recovery behavior.
Existing docs to align
docs/src/reference/populi.mddocs/src/reference/mens-training.mddocs/src/architecture/mens-training-ssot.md
Evolution model (converge to SSOT, avoid magic values)
Use a contract-first control surface:
stub-policy.v1.json: score weights, thresholds, risk multipliers.suppression.v1.schema.json: keep owner/reason/expiry strict.training-feedback.v1.json: immutable event feed to Populi.toestub-run-json.v2.schema.json: add optional evidence summary and calibration stats.
Policy knobs should be loaded dynamically and fingerprinted in output metadata so runs are reproducible and auditable.
Adoption stages
- Stage 0 (shadow): new scorer runs in parallel, no gate effect.
- Stage 1 (assist): emits warnings with confidence/evidence.
- Stage 2 (balanced gate): high-confidence errors gate, medium-confidence warnings annotate.
- Stage 3 (self-heal safe): safe autofixes enabled with targeted verification.
- Stage 4 (training loop): Populi ingestion drives calibrated threshold updates under governance.
Architecture risks and mitigations
- Risk: semantic scoring increases runtime.
Mitigation: two-phase pipeline; skip deep analysis for low-signal files. - Risk: overfitting to current codebase patterns.
Mitigation: maintain curated TP/FP/FN fixtures + periodic drift review. - Risk: unsafe auto-remediation regressions.
Mitigation: safe/unsafe fix classes + mandatory targeted tests + rollback. - Risk: training data poisoning from noisy findings.
Mitigation: ingest only adjudicated findings with confidence and outcome labels. - Risk: event payload sprawl in generic
research_metrics.
Mitigation: strict payload schemas, version tags, and promotion of only high-value fields into typed columns. - Risk: schema churn from over-eager normalization.
Mitigation: JSON-first for early iterations, then additive columns on proven query paths only.
Minimal success metrics (first promotion)
stub/placeholderfalse-positive rate reduced by at least 40% vs current baseline.- No increase in
rust_parse_failures. - Mean TOESTUB runtime increase <= 20% for
crates/scan in audit mode. - At least one Populi ingestion path operational with schema-validated training feedback export.
References
- Ruff: docs, FAQ
- rust-analyzer: architecture, incrementality
- Trunk Code Quality: overview
- CodeQL: about, path queries
- Semgrep: docs, feature definitions
- Biome: linter
- Clippy: docs
- golangci-lint: configuration, false positives