Agent promptsLlm evalsAi promptsClaude prompts

An Agent Output Verification Rubric You Can Paste on Every PR

Use a fill-in agent output verification rubric that scores a coding agent's PR pass/fail on correctness, scope, and side effects. Copy the verdict contract.

PPromptsCart Team·June 29, 2026·Updated June 29, 2026·7 min read

A coding agent opens a pull request. It fixed the bug. It also renamed a variable in an unrelated file, bumped a dependency, and deleted a test that was "redundant." The diff is green. The review takes you twenty minutes because there's no fixed thing to check against. An agent output verification rubric turns that twenty minutes into a two-minute pass/fail with the scope creep flagged automatically.

The existing guidance lists metrics. It doesn't hand you a rubric. This post gives you a fill-in template with a verdict contract you can paste on every PR an agent opens, scoring three things that decide whether agent code is safe to merge: correctness, scope, and side effects.

The third one is where agents quietly hurt you. So that's where the rubric pushes hardest.

Why a metrics catalog isn't a rubric

There's no shortage of lists. LangChain's agent evaluation readiness checklist tells you what to think about, Confident AI's complete guide catalogs metrics, and this Dev.to post on lightweight evals argues for skipping frameworks. All useful background. None ships the artifact: a fill-in rubric with a pass/fail verdict contract you can apply to a specific diff in front of you right now.

That's the gap. A list of metrics is something to read. A rubric is something to run. The difference is whether you can paste it under a PR and get a verdict in the same shape every time.

The stance here: scope adherence is the most under-checked axis in agent review, and it's the one that bites in production. A correct fix that also touches eight unrelated files is a worse PR than an incomplete fix that touches one. Correctness gets all the attention. Scope is where agents go feral.

What you can do with this rubric

Score a coding agent's PR pass/fail before a human reads the full diff.
Catch scope creep: files changed that the task never mentioned.
Flag side effects like deleted tests, changed configs, or new dependencies.
Produce a verdict you can paste directly into a PR comment.
Apply the same rubric across every agent and every repo for consistency.
Decide merge-versus-revise in one read instead of a full manual audit.

Anatomy of the rubric

The rubric takes the task, the diff, and the agent's own summary, then scores three axes and emits a verdict.

Variables:
  {{task_description}} – what the agent was asked to do
  {{diff}}             – the full change, all files
  {{agent_summary}}    – what the agent says it did
  {{protected_paths}}  – files/dirs that must not change

Prompt:
  Role: reviewer gating an agent's PR.
  Task: score correctness, scope, side-effects. Pass/fail each.
  A change to any protected path is an automatic FAIL.

Output contract (restate on the final line):
  correctness: PASS|FAIL + evidence line
  scope_adherence: PASS|FAIL + list of out-of-scope files
  side_effects: PASS|FAIL + deleted tests / new deps / config
  verdict: MERGE | REVISE | REJECT
  required_fixes: ordered list (empty if MERGE)

The {{protected_paths}} variable encodes the lines an agent must never cross: CI config, migrations, auth. A touch there fails the rubric regardless of how clean the rest looks.

How to use the rubric

1. State the task narrowly

Fill {{task_description}} with exactly what was requested, no more. "Fix the null check in parseDate." If the task is fuzzy, scope adherence becomes unscorable, because you can't flag scope creep without a boundary to creep past.

2. Paste the full diff

{{diff}} is every file the agent touched, not just the headline change. The whole point is catching the files the agent changed that the task never mentioned.

3. Include the agent's own summary

{{agent_summary}} lets the rubric compare what the agent claims it did against what the diff shows. Gaps between the two are a reliable smell.

4. List the protected paths

{{protected_paths}} is your tripwire set. Anything in here that changes is an automatic FAIL, no matter the justification. This is the scope adherence check with teeth.

5. Read the verdict and the required fixes

A MERGE verdict still earns a glance. A REVISE comes with an ordered fix list you can hand straight back to the agent. A REJECT means start over. The verdict contract makes the next action obvious.

Rubric-craft patterns

Evidence per verdict, never a bare grade. Every pass/fail carries an evidence line: the file and reason. A bare FAIL is unactionable. "scope_adherence: FAIL, touched config/ci.yml and auth/session.ts, neither in task" tells the agent exactly what to undo.

For each axis, output the verdict AND one evidence line
naming the specific files or lines. A verdict without
evidence is invalid; mark it NEEDS-REVIEW instead.

Protected paths as a hard gate. The pass fail agent rubric should never let an auth or migration change slide because the feature looked good. Make protected-path violations short-circuit to FAIL before the other axes are even scored.

Restate the verdict schema last. Across a long diff, GPT-4o tends to forget the three-axis structure and write a paragraph review. Claude holds it better. Restate correctness / scope_adherence / side_effects / verdict on the final line so the contract survives the long input.

The side-effects axis is where agents hide damage

A passing test suite doesn't prove an agent behaved. It deleted a flaky test, removed an assertion, or pinned a dependency to make red go green. The side_effects axis exists to catch exactly this: scan for removed tests, weakened assertions, new dependencies, and config edits the task never asked for. Green CI plus a clean side-effects score is the real merge signal, not green CI alone.

Variables you'll set

Variable	Required	What it is
`{{task_description}}`	Yes	Exactly what the agent was asked to do
`{{diff}}`	Yes	The full change across all files
`{{agent_summary}}`	No	The agent's own description of its work
`{{protected_paths}}`	Yes	Files/dirs that must never change

Getting started

Write the task narrowly enough that scope creep is detectable.
Paste the complete diff, not just the headline files.
List your protected paths.
Run the rubric and read the verdict plus evidence lines.
Hand the required_fixes list back to the agent on a REVISE.
Treat any protected-path FAIL as non-negotiable.
Standardize the rubric across repos. The Agent Output Verification Rubric ships this exact template with the three-axis scoring and the protected-path gate built in.

Get the Agent Output Verification Rubric →

A verification rubric is one piece of a real agent-ops loop. Pair it with the Agent Eval Harness Builder for the test set the agent runs against, and the LLM Eval System Design playbook when you need a judge for the open-ended outputs the rubric doesn't cover.

Skip the setup

The Agent Output Verification Rubric does this end-to-end: a {{protected_paths}} variable hard-gates the files an agent must never touch, and the verdict contract scores correctness, scope, and side effects with an evidence line per axis. It's part of The Complete AI Prompts Bundle, a one-time lifetime license to the whole catalog plus every later pack, sensible if you review more than one agent's output regularly.

Get the Agent Output Verification Rubric →

The judging side of this, for free-form generations rather than diffs, lives in the LLM-as-a-judge grader template. And to stop your rubric from quietly grading differently after a model update, see prompt regression testing.

Browse the agent-ops prompt packs →

FAQ

Common questions

What is an agent output verification rubric?

It's a fixed checklist that scores a coding agent's output pass/fail across correctness, scope adherence, and side effects, then emits a single verdict you can paste on a pull request. Unlike a metrics catalog, it's a fill-in artifact: same rubric, same shape of verdict, every PR.

What should an agent rubric actually check?

Three things at minimum: does the change do what was asked (correctness), does it do only what was asked (scope adherence), and does it touch anything it shouldn't (side effects like deleted tests, changed configs, new dependencies). Each scored pass/fail with evidence, not a vibe.

How is this different from an LLM-as-a-judge prompt?

A judge grades free-form text answers. A verification rubric is tuned for code changes: it checks scope creep, unrequested file edits, and silent side effects that a general grader misses. Use the rubric for agent PRs and the judge for open-ended generations.

Stop reading. Start shipping.

Get the prompt packs this guide is built on

Ready-to-paste prompts with documented variables and worked examples for ChatGPT, Claude, and Gemini. One-time payment, own it forever.

Browse prompt packs ← All articles

More prompt guides

All posts

An LLM-as-a-Judge Prompt: The Rubric Grader Template Tools Bury

7 min read

Ai promptsLlm evalsClaude prompts

An LLM-as-a-Judge Prompt: The Rubric Grader Template Tools Bury

Every eval platform documents llm-as-a-judge. Few hand you the prompt. The Langfuse docs explain the concept and the promptfoo guide wires it into config, while Towards Data Science's practical guide…

Jun 28, 2026Read more →

Build an LLM Eval Harness With a Prompt That Designs the Eval Set First

7 min read

Ai promptsAgent promptsLlm evals

Build an LLM Eval Harness With a Prompt That Designs the Eval Set First

Most "build an eval harness" guides start at the wrong end. They open with YAML config, a runner library, and a metrics dashboard, then leave the actual eval set as an exercise for the reader. But the…

Jun 27, 2026Read more →

Resolve Merge Conflicts With an AI Prompt That Reads Both Branch Intents

8 min read

Ai promptsAgent promptsGit

Resolve Merge Conflicts With an AI Prompt That Reads Both Branch Intents

A three-way merge fails, the markers land in the file, and the tempting move is to click "accept incoming" and run the tests. That works until the test suite is thin. Then a silent semantic break ship…

Jun 26, 2026Read more →