agentic-eval Agent Skill | mattriley.tools

SKILL.md

Agentic Evaluation

Use this skill when you are designing or implementing an evaluation loop that lets an agent assess and improve its own outputs through iteration — not when you are running a pre-existing test suite or doing a one-off review with no refinement cycle.

The core pattern is: Generate → Evaluate → Critique → Refine → Output, looping until a convergence condition is met or a max-iteration budget is exhausted.

Use this skill when

Implementing a self-critique or reflection loop that feeds output quality back into generation.
Building an evaluator-optimizer pipeline that separates generation from evaluation responsibilities.
Designing LLM-as-judge scoring to compare or rank multiple candidate outputs.
Adding rubric-based scoring with weighted dimensions to iterative generation.
Setting iteration limits, convergence checks, or structured evaluation output contracts.
The task requires measurable improvement across runs, not just a single-shot best effort.

Do not use this skill when

You are running an existing test suite to verify code — use verification-before-completion.
You are diagnosing a specific failure or bug, not evaluating output quality — use systematic-debugging.
The goal is writing test coverage (unit tests, integration tests) — use test-driven-development.
You are reviewing a completed artifact once without a refinement loop (a single code review, an editorial pass, a PR check).

Routing boundary

Situation	Use this skill?	Route instead
Designing a reflection loop with a score threshold and max iterations	Yes	—
Implementing LLM-as-judge comparison of two candidate outputs	Yes	—
Running `npm test` to confirm a fix works	No	`verification-before-completion`
Tracing why a specific assertion fails	No	`systematic-debugging`
Writing Jest or pytest test coverage for a module	No	`test-driven-development`
Reviewing a PR diff once, no iteration	No	`review-comment-resolution`

Inputs to gather

Required before starting

The skill or agent behavior to evaluate.
The target metric: trigger accuracy, refusal rate, or behavioral assertion coverage.

Helpful if present

Existing evals or trigger-queries files to extend.

First move

Identify the skill or behavior to evaluate.
Check whether a trigger-queries.json already exists; if so, load it to understand scope.
Open the relevant reference file based on the evaluation type.

Navigation

The three evaluation strategy patterns (outcome-based, LLM-as-judge, rubric-based) and full Python examples are in references/patterns.md.

The implementation checklist — criteria, threshold, loop wiring, convergence, logging — is in assets/eval-checklist.md.

For a new implementation, start with the checklist to confirm your setup is complete, then use the patterns reference to choose and adapt an evaluation strategy.

Outputs

Evaluation loop design with defined criteria, convergence check, and max iteration budget.
Structured evaluation scores per iteration with input, output, and critique logged.
Convergence or budget-exhaustion result confirming the loop terminated cleanly.

Workflow

See the body and references for agentic evaluation design and loop steps.

Examples

See references and the skill body for agentic-eval examples.

Reference files

See the references/ directory and linked files in the main content.

Guardrails

Always set a max_iterations bound (3–5 is a safe default) before wiring up a refinement loop. Unbounded loops stall agents.
Require structured output (JSON) from the evaluation step so the optimize step has a reliable signal to act on. Free-text critique is fragile.
Add a convergence check: if the score does not improve between iterations, stop early. Oscillating loops that never converge waste budget.
Log the full iteration trajectory. Evaluation loops are hard to debug post-hoc without a history of inputs, outputs, scores, and critiques.
Define evaluation criteria before generating any output. Criteria added mid-loop drift and make scores incomparable across iterations.
Keep the evaluate step isolated from the generate step. Blending them makes it hard to replace the evaluator or diagnose score instability.
Handle evaluation parse failures gracefully — if the LLM judge returns malformed JSON, fall back to a safe default (treat as failing) rather than crashing the loop.

Validation

should trigger: "I want to add a reflection loop to my code-generation agent so it self-critiques and reruns until the score exceeds 0.85"
should not trigger: "Run the test suite and tell me if the build passes"
should not trigger: "Why is this specific assertion failing in my TypeScript tests?"

After implementing an evaluation loop, confirm:

max_iterations is set and respected by the loop
Evaluate step returns structured output (JSON or equivalent)
Convergence check exits early when score does not improve
All iterations are logged with input, output, score, and critique
Parse-failure fallback is present on the evaluate step
Criteria are defined before any generation begins

Examples

"Add a self-critique loop to my report-generation agent that retries up to three times if the rubric score is below 0.8."
"Implement an evaluator-optimizer where a separate LLM judge scores code clarity and the generator rewrites until it passes."
"Build a rubric-based evaluator with accuracy, completeness, and style dimensions that returns a weighted score as JSON."

Reference files

references/patterns.md — The three evaluation strategy patterns (outcome-based, LLM-as-judge, rubric-based) with annotated Python examples and a best-practices table.
assets/eval-checklist.md — Implementation checklist: setup, loop wiring, convergence, logging, and safety items to confirm before shipping.