"Evaluating AI Plan Adequacy Heuristics"

Plan Adequacy Scoring: Heuristics vs. Semantic Validation

1. Context & Analyzed Systems

Evaluation of pre-execution Plan Adequacy signals:

Minimum Token Count per task.
Maximum Estimated Goal Complexity (heuristic cap at 9 tasks).
"Structural Noise" via Task Count limits and "Flat DAG" penalties.
Regex Vagueness Detection (e.g., blacklisted words like "TBD", "figure out", "remove").

Correlating text length/word count to architectural adequacy incentivizes "evaluation hacking".

LLMs systemically mask hallucinated logic with fluent verbosity.
Dense, highly technical instructions (which are mathematically efficient) trigger false positive blocks simply because they fall under arbitrary token minimums.

Arbitrarily capping estimated complexity at a threshold of 9 is an incorrect application of Miller's Law of Human Working Memory ($7 \pm 2$).
LLMs do not suffer from human cognitive load limits; their algorithmic capabilities map to context window/compute constraints. This compression neutralizes heuristic signal values.

Flagging vague terms (e.g., TBD) misses semantic ambiguity, generating mass false negatives for implicitly vague technical filler.
Utilizing keyword blocks for "destructive actions" (e.g., matching "delete/drop") is completely evaded by simple declarative phrasing or passive AI constructions (e.g., "The production database's storage should be cleared"). This is a severe security vulnerability.

Identifying Flat DAGs correctly penalizes an LLM's failure to recognize chronological state dependencies.
However, enforcing DAG depth purely syntactically causes the LLM to hallucinate arbitrary, non-functional dependency edges to game the evaluation module.

Shift to Programmatic Prompts / Preconditions: Avoid text heuristics. Force models to output structured actions accompanied by explicit pre-condition assertions (e.g. assert database_active == true). Fail adequacy if precondition logic doesn't exist.
LLMs-as-Formalizers (NL-PDDL): Evaluate Natural Language via formal semantic frameworks like NL-PDDL. Use lifted regression algorithms to execute entailment checking—verifying mathematically if the steps actually entail the final desired state.
Implement LLM-as-a-Judge Coverage Testing: Deprecate keyword regex. Utilize a fine-tuned evaluator LLM (Socratic Self-Refine) constrained by a rubric to identify missing dependencies, unstated destructive actions framed globally, and entity coverage matching against the prompt.