SKILL.md
Agentic Evaluation
Use this skill when you are designing or implementing an evaluation loop that lets an agent assess and improve its own outputs through iteration — not when you are running a pre-existing test suite or doing a one-off review with no refinement cycle.
The core pattern is: Generate → Evaluate → Critique → Refine → Output, looping until a convergence condition is met or a max-iteration budget is exhausted.
Use this skill when
- Implementing a self-critique or reflection loop that feeds output quality back into generation.
- Building an evaluator-optimizer pipeline that separates generation from evaluation responsibilities.
- Designing LLM-as-judge scoring to compare or rank multiple candidate outputs.
- Adding rubric-based scoring with weighted dimensions to iterative generation.
- Setting iteration limits, convergence checks, or structured evaluation output contracts.
- The task requires measurable improvement across runs, not just a single-shot best effort.
Do not use this skill when
- You are running an existing test suite to verify code — use
verification-before-completion. - You are diagnosing a specific failure or bug, not evaluating output quality — use
systematic-debugging. - The goal is writing test coverage (unit tests, integration tests) — use
test-driven-development. - You are reviewing a completed artifact once without a refinement loop (a single code review, an editorial pass, a PR check).
Routing boundary
| Situation | Use this skill? | Route instead |
|---|---|---|
| Designing a reflection loop with a score threshold and max iterations | Yes | — |
| Implementing LLM-as-judge comparison of two candidate outputs | Yes | — |
Running npm test to confirm a fix works |
No | verification-before-completion |
| Tracing why a specific assertion fails | No | systematic-debugging |
| Writing Jest or pytest test coverage for a module | No | test-driven-development |
| Reviewing a PR diff once, no iteration | No | review-comment-resolution |
Inputs to gather
Required before starting
- The skill or agent behavior to evaluate.
- The target metric: trigger accuracy, refusal rate, or behavioral assertion coverage.
Helpful if present
- Existing evals or trigger-queries files to extend.
First move
- Identify the skill or behavior to evaluate.
- Check whether a
trigger-queries.jsonalready exists; if so, load it to understand scope. - Open the relevant reference file based on the evaluation type.
Navigation
The three evaluation strategy patterns (outcome-based, LLM-as-judge, rubric-based) and full Python examples are in references/patterns.md.
The implementation checklist — criteria, threshold, loop wiring, convergence, logging — is in assets/eval-checklist.md.
For a new implementation, start with the checklist to confirm your setup is complete, then use the patterns reference to choose and adapt an evaluation strategy.
Outputs
- Evaluation loop design with defined criteria, convergence check, and max iteration budget.
- Structured evaluation scores per iteration with input, output, and critique logged.
- Convergence or budget-exhaustion result confirming the loop terminated cleanly.
Workflow
See the body and references for agentic evaluation design and loop steps.
Examples
See references and the skill body for agentic-eval examples.
Reference files
See the references/ directory and linked files in the main content.
Guardrails
- Always set a
max_iterationsbound (3–5 is a safe default) before wiring up a refinement loop. Unbounded loops stall agents. - Require structured output (JSON) from the evaluation step so the optimize step has a reliable signal to act on. Free-text critique is fragile.
- Add a convergence check: if the score does not improve between iterations, stop early. Oscillating loops that never converge waste budget.
- Log the full iteration trajectory. Evaluation loops are hard to debug post-hoc without a history of inputs, outputs, scores, and critiques.
- Define evaluation criteria before generating any output. Criteria added mid-loop drift and make scores incomparable across iterations.
- Keep the evaluate step isolated from the generate step. Blending them makes it hard to replace the evaluator or diagnose score instability.
- Handle evaluation parse failures gracefully — if the LLM judge returns malformed JSON, fall back to a safe default (treat as failing) rather than crashing the loop.
Validation
- should trigger: "I want to add a reflection loop to my code-generation agent so it self-critiques and reruns until the score exceeds 0.85"
- should not trigger: "Run the test suite and tell me if the build passes"
- should not trigger: "Why is this specific assertion failing in my TypeScript tests?"
After implementing an evaluation loop, confirm:
-
max_iterationsis set and respected by the loop - Evaluate step returns structured output (JSON or equivalent)
- Convergence check exits early when score does not improve
- All iterations are logged with input, output, score, and critique
- Parse-failure fallback is present on the evaluate step
- Criteria are defined before any generation begins
Examples
- "Add a self-critique loop to my report-generation agent that retries up to three times if the rubric score is below 0.8."
- "Implement an evaluator-optimizer where a separate LLM judge scores code clarity and the generator rewrites until it passes."
- "Build a rubric-based evaluator with accuracy, completeness, and style dimensions that returns a weighted score as JSON."
Reference files
references/patterns.md— The three evaluation strategy patterns (outcome-based, LLM-as-judge, rubric-based) with annotated Python examples and a best-practices table.assets/eval-checklist.md— Implementation checklist: setup, loop wiring, convergence, logging, and safety items to confirm before shipping.