An Agent Output Verification Rubric You Can Paste on Every PR
Use a fill-in agent output verification rubric that scores a coding agent's PR pass/fail on correctness, scope, and side effects. Copy the verdict contract.
A coding agent opens a pull request. It fixed the bug. It also renamed a variable in an unrelated file, bumped a dependency, and deleted a test that was "redundant." The diff is green. The review takes you twenty minutes because there's no fixed thing to check against. An agent output verification rubric turns that twenty minutes into a two-minute pass/fail with the scope creep flagged automatically.
The existing guidance lists metrics. It doesn't hand you a rubric. This post gives you a fill-in template with a verdict contract you can paste on every PR an agent opens, scoring three things that decide whether agent code is safe to merge: correctness, scope, and side effects.
The third one is where agents quietly hurt you. So that's where the rubric pushes hardest.
Why a metrics catalog isn't a rubric
There's no shortage of lists. LangChain's agent evaluation readiness checklist tells you what to think about, Confident AI's complete guide catalogs metrics, and this Dev.to post on lightweight evals argues for skipping frameworks. All useful background. None ships the artifact: a fill-in rubric with a pass/fail verdict contract you can apply to a specific diff in front of you right now.
That's the gap. A list of metrics is something to read. A rubric is something to run. The difference is whether you can paste it under a PR and get a verdict in the same shape every time.
The stance here: scope adherence is the most under-checked axis in agent review, and it's the one that bites in production. A correct fix that also touches eight unrelated files is a worse PR than an incomplete fix that touches one. Correctness gets all the attention. Scope is where agents go feral.
What you can do with this rubric
- Score a coding agent's PR pass/fail before a human reads the full diff.
- Catch scope creep: files changed that the task never mentioned.
- Flag side effects like deleted tests, changed configs, or new dependencies.
- Produce a verdict you can paste directly into a PR comment.
- Apply the same rubric across every agent and every repo for consistency.
- Decide merge-versus-revise in one read instead of a full manual audit.
Anatomy of the rubric
The rubric takes the task, the diff, and the agent's own summary, then scores three axes and emits a verdict.
Variables:
{{task_description}} – what the agent was asked to do
{{diff}} – the full change, all files
{{agent_summary}} – what the agent says it did
{{protected_paths}} – files/dirs that must not change
Prompt:
Role: reviewer gating an agent's PR.
Task: score correctness, scope, side-effects. Pass/fail each.
A change to any protected path is an automatic FAIL.
Output contract (restate on the final line):
correctness: PASS|FAIL + evidence line
scope_adherence: PASS|FAIL + list of out-of-scope files
side_effects: PASS|FAIL + deleted tests / new deps / config
verdict: MERGE | REVISE | REJECT
required_fixes: ordered list (empty if MERGE)
The {{protected_paths}} variable encodes the lines an agent must never cross: CI config, migrations, auth. A touch there fails the rubric regardless of how clean the rest looks.
How to use the rubric
1. State the task narrowly
Fill {{task_description}} with exactly what was requested, no more. "Fix the null check in parseDate." If the task is fuzzy, scope adherence becomes unscorable, because you can't flag scope creep without a boundary to creep past.
2. Paste the full diff
{{diff}} is every file the agent touched, not just the headline change. The whole point is catching the files the agent changed that the task never mentioned.
3. Include the agent's own summary
{{agent_summary}} lets the rubric compare what the agent claims it did against what the diff shows. Gaps between the two are a reliable smell.
4. List the protected paths
{{protected_paths}} is your tripwire set. Anything in here that changes is an automatic FAIL, no matter the justification. This is the scope adherence check with teeth.
5. Read the verdict and the required fixes
A MERGE verdict still earns a glance. A REVISE comes with an ordered fix list you can hand straight back to the agent. A REJECT means start over. The verdict contract makes the next action obvious.
Rubric-craft patterns
Evidence per verdict, never a bare grade. Every pass/fail carries an evidence line: the file and reason. A bare FAIL is unactionable. "scope_adherence: FAIL, touched config/ci.yml and auth/session.ts, neither in task" tells the agent exactly what to undo.
For each axis, output the verdict AND one evidence line
naming the specific files or lines. A verdict without
evidence is invalid; mark it NEEDS-REVIEW instead.
Protected paths as a hard gate. The pass fail agent rubric should never let an auth or migration change slide because the feature looked good. Make protected-path violations short-circuit to FAIL before the other axes are even scored.
Restate the verdict schema last. Across a long diff, GPT-4o tends to forget the three-axis structure and write a paragraph review. Claude holds it better. Restate correctness / scope_adherence / side_effects / verdict on the final line so the contract survives the long input.
A passing test suite doesn't prove an agent behaved. It deleted a flaky test, removed an assertion, or pinned a dependency to make red go green. The side_effects axis exists to catch exactly this: scan for removed tests, weakened assertions, new dependencies, and config edits the task never asked for. Green CI plus a clean side-effects score is the real merge signal, not green CI alone.
Variables you'll set
| Variable | Required | What it is |
|---|---|---|
{{task_description}} | Yes | Exactly what the agent was asked to do |
{{diff}} | Yes | The full change across all files |
{{agent_summary}} | No | The agent's own description of its work |
{{protected_paths}} | Yes | Files/dirs that must never change |
Getting started
- Write the task narrowly enough that scope creep is detectable.
- Paste the complete diff, not just the headline files.
- List your protected paths.
- Run the rubric and read the verdict plus evidence lines.
- Hand the
required_fixeslist back to the agent on a REVISE. - Treat any protected-path FAIL as non-negotiable.
- Standardize the rubric across repos. The Agent Output Verification Rubric ships this exact template with the three-axis scoring and the protected-path gate built in.
A verification rubric is one piece of a real agent-ops loop. Pair it with the Agent Eval Harness Builder for the test set the agent runs against, and the LLM Eval System Design playbook when you need a judge for the open-ended outputs the rubric doesn't cover.
The Agent Output Verification Rubric does this end-to-end: a {{protected_paths}} variable hard-gates the files an agent must never touch, and the verdict contract scores correctness, scope, and side effects with an evidence line per axis. It's part of The Complete AI Prompts Bundle, a one-time lifetime license to the whole catalog plus every later pack, sensible if you review more than one agent's output regularly.
The judging side of this, for free-form generations rather than diffs, lives in the LLM-as-a-judge grader template. And to stop your rubric from quietly grading differently after a model update, see prompt regression testing.
Browse the agent-ops prompt packs →Common questions
What is an agent output verification rubric?
What should an agent rubric actually check?
How is this different from an LLM-as-a-judge prompt?
Get the prompt packs this guide is built on
Ready-to-paste prompts with documented variables and worked examples for ChatGPT, Claude, and Gemini. One-time payment, own it forever.
More prompt guides

An LLM-as-a-Judge Prompt: The Rubric Grader Template Tools Bury
Every eval platform documents llm-as-a-judge. Few hand you the prompt. The Langfuse docs explain the concept and the promptfoo guide wires it into config, while Towards Data Science's practical guide…

Build an LLM Eval Harness With a Prompt That Designs the Eval Set First
Most "build an eval harness" guides start at the wrong end. They open with YAML config, a runner library, and a metrics dashboard, then leave the actual eval set as an exercise for the reader. But the…

Resolve Merge Conflicts With an AI Prompt That Reads Both Branch Intents
A three-way merge fails, the markers land in the file, and the tempting move is to click "accept incoming" and run the tests. That works until the test suite is thin. Then a silent semantic break ship…