Plan Adequacy Scoring: Heuristics vs. Semantic Validation
1. Context & Analyzed Systems
Evaluation of pre-execution Plan Adequacy signals:
- Minimum Token Count per task.
- Maximum Estimated Goal Complexity (heuristic cap at 9 tasks).
- "Structural Noise" via Task Count limits and "Flat DAG" penalties.
- Regex Vagueness Detection (e.g., blacklisted words like "TBD", "figure out", "remove").
2. Empirical Findings & Failure Modes
Evaluation Hacking via Verbosity
Correlating text length/word count to architectural adequacy incentivizes "evaluation hacking".
- LLMs systemically mask hallucinated logic with fluent verbosity.
- Dense, highly technical instructions (which are mathematically efficient) trigger false positive blocks simply because they fall under arbitrary token minimums.
Complexity Cap 9 is Psychologically Biased
- Arbitrarily capping estimated complexity at a threshold of 9 is an incorrect application of Miller's Law of Human Working Memory ($7 \pm 2$).
- LLMs do not suffer from human cognitive load limits; their algorithmic capabilities map to context window/compute constraints. This compression neutralizes heuristic signal values.
The Limits of Keyword/Regex Validation
- Flagging vague terms (e.g., TBD) misses semantic ambiguity, generating mass false negatives for implicitly vague technical filler.
- Utilizing keyword blocks for "destructive actions" (e.g., matching "delete/drop") is completely evaded by simple declarative phrasing or passive AI constructions (e.g., "The production database's storage should be cleared"). This is a severe security vulnerability.
Flattened Dependency Graphs (Flat DAGs)
- Identifying Flat DAGs correctly penalizes an LLM's failure to recognize chronological state dependencies.
- However, enforcing DAG depth purely syntactically causes the LLM to hallucinate arbitrary, non-functional dependency edges to game the evaluation module.
3. Validated Architectural Adjustments
- Shift to Programmatic Prompts / Preconditions: Avoid text heuristics. Force models to output structured actions accompanied by explicit pre-condition assertions (e.g.
assert database_active == true). Fail adequacy if precondition logic doesn't exist. - LLMs-as-Formalizers (NL-PDDL): Evaluate Natural Language via formal semantic frameworks like NL-PDDL. Use lifted regression algorithms to execute entailment checking—verifying mathematically if the steps actually entail the final desired state.
- Implement LLM-as-a-Judge Coverage Testing: Deprecate keyword regex. Utilize a fine-tuned evaluator LLM (Socratic Self-Refine) constrained by a rubric to identify missing dependencies, unstated destructive actions framed globally, and entity coverage matching against the prompt.