"Evaluating AI Plan Adequacy Heuristics"

Plan Adequacy Scoring: Heuristics vs. Semantic Validation

1. Context & Analyzed Systems

Evaluation of pre-execution Plan Adequacy signals:

  • Minimum Token Count per task.
  • Maximum Estimated Goal Complexity (heuristic cap at 9 tasks).
  • "Structural Noise" via Task Count limits and "Flat DAG" penalties.
  • Regex Vagueness Detection (e.g., blacklisted words like "TBD", "figure out", "remove").

2. Empirical Findings & Failure Modes

Evaluation Hacking via Verbosity

Correlating text length/word count to architectural adequacy incentivizes "evaluation hacking".

  • LLMs systemically mask hallucinated logic with fluent verbosity.
  • Dense, highly technical instructions (which are mathematically efficient) trigger false positive blocks simply because they fall under arbitrary token minimums.

Complexity Cap 9 is Psychologically Biased

  • Arbitrarily capping estimated complexity at a threshold of 9 is an incorrect application of Miller's Law of Human Working Memory ($7 \pm 2$).
  • LLMs do not suffer from human cognitive load limits; their algorithmic capabilities map to context window/compute constraints. This compression neutralizes heuristic signal values.

The Limits of Keyword/Regex Validation

  • Flagging vague terms (e.g., TBD) misses semantic ambiguity, generating mass false negatives for implicitly vague technical filler.
  • Utilizing keyword blocks for "destructive actions" (e.g., matching "delete/drop") is completely evaded by simple declarative phrasing or passive AI constructions (e.g., "The production database's storage should be cleared"). This is a severe security vulnerability.

Flattened Dependency Graphs (Flat DAGs)

  • Identifying Flat DAGs correctly penalizes an LLM's failure to recognize chronological state dependencies.
  • However, enforcing DAG depth purely syntactically causes the LLM to hallucinate arbitrary, non-functional dependency edges to game the evaluation module.

3. Validated Architectural Adjustments

  1. Shift to Programmatic Prompts / Preconditions: Avoid text heuristics. Force models to output structured actions accompanied by explicit pre-condition assertions (e.g. assert database_active == true). Fail adequacy if precondition logic doesn't exist.
  2. LLMs-as-Formalizers (NL-PDDL): Evaluate Natural Language via formal semantic frameworks like NL-PDDL. Use lifted regression algorithms to execute entailment checking—verifying mathematically if the steps actually entail the final desired state.
  3. Implement LLM-as-a-Judge Coverage Testing: Deprecate keyword regex. Utilize a fine-tuned evaluator LLM (Socratic Self-Refine) constrained by a rubric to identify missing dependencies, unstated destructive actions framed globally, and entity coverage matching against the prompt.