How to Evaluate AI Coding Agent Output Without Building a Harness
No eval code needed to grade an agent's diff. Learn how to evaluate AI coding agent output with a pasteable rubric for outcome, process, and style. Copy it.
The agent finished. The diff looks plausible. Now what? Most teams either merge on a glance or stand up an eval harness they never finish. There's a middle path almost nobody ships: a prompt that grades the change. Learning to evaluate AI coding agent output with a rubric, not a codebase, is the cheapest reliability win available.
Everything that ranks assumes you'll write eval code. Braintrust's agent-evaluation article covers trajectory evals. The Medium tutorial on building an eval harness walks the build. The awesome-harness-engineering list collects more of the same. All of it points at code. None of it gives a reviewer a prompt to grade a diff against outcome, process, and style without standing up infrastructure.
That's the gap a verification rubric fills. Paste the diff, paste the goal, get a verdict.
What a verification rubric produces
A verification rubric is a prompt that grades one agent change against fixed axes and returns a pass or fail per axis with the evidence behind each call. It's a structured second opinion you can run in seconds.
The job it does on every diff:
- Check the outcome: did the change actually meet the stated goal
- Check the process: did the agent stay inside the constraints and scope
- Check the style: does the code match the conventions already in the repo
- Cite evidence for each verdict, not just a thumbs up
- Catch the silent scope creep (the drive-by refactor nobody asked for)
- Return a clear merge / fix / reject signal
- Apply the same standard whether Claude, ChatGPT, or Gemini wrote the code
The three axes matter because they fail independently. A change can hit the goal (outcome pass) while quietly rewriting an unrelated module (process fail). Grading them separately surfaces exactly that.
The anatomy of the rubric prompt
The prompt takes the goal, the diff, and the repo conventions, then returns a per-axis verdict table.
Variables → {{goal}}, {{diff}}, {{conventions}}
Prompt → role: strict verifier grading a single agent change
task: judge outcome, process, style independently
rule: every verdict cites evidence from the diff
Output → table: axis | verdict (pass/fail) | evidence | required fix
Put the verdict contract last. On a long {{diff}}, the rubric format stated up top gets out-weighted by recent tokens and the model reverts to a friendly summary. Restating the table format on the final line keeps the verdicts crisp, which matters most on GPT-4o.
1. State the goal the agent was given
Paste the original task verbatim. The rubric can't grade outcome against a goal it doesn't have. Vague goal in, vague verdict out.
2. Fill the variables
Drop the goal into {{goal}}, the diff into {{diff}}, and a short conventions note into {{conventions}} (test style, naming, error handling).
3. Run the rubric
You get a table with a verdict per axis and the evidence each verdict rests on. Read the evidence, not just the pass/fail. A pass with weak evidence ("seems fine") is really an unsure verdict wearing a green checkmark, and on a fail-closed rubric that should resolve to a fail. The evidence column is the part that keeps the grader honest. Skim it the way you'd skim a junior reviewer's reasoning: not to second-guess every call, but to catch the one where the justification doesn't actually support the verdict.
4. Act on the weakest axis
One fail is enough to send it back. A process fail (touched files it shouldn't) is often more dangerous than a style fail, even though it's less visible. Style you'll notice on read. Scope creep hides in files you didn't think to open, which is exactly why the rubric grades them as a separate axis instead of folding everything into one "looks good" verdict. Read the evidence line for any failing axis before you decide how to push back. A process fail with the evidence "edited auth/session.ts, not named in the goal" is a different conversation than a style fail over a variable name, and the fix is usually faster too: revert the out-of-scope change rather than rewrite the logic.
5. Re-grade after the fix
Re-run on the corrected diff. The rubric is cheap, so grade every iteration, not just the first.
The change that meets its goal while quietly editing files outside scope is the one that bites you later. A rubric that grades process separately from outcome catches it. A single "looks good?" prompt never will, because the diff genuinely does solve the problem you asked about.
Prompt-craft patterns for honest grading
Two patterns keep the rubric from rubber-stamping, plus a hard rule.
Evidence-required verdicts. No verdict without a quote from the diff.
For each axis, output:
- verdict: pass or fail
- evidence: a specific line or change from {{diff}} that justifies it
A verdict with no evidence is invalid; default that axis to fail.
Define what fail looks like. Lenient models pass everything unless you show them a failing example.
Process FAIL examples: edited files unrelated to {{goal}};
added a dependency not requested; changed public API without being asked.
The opinion that'll save you the most grief: make the rubric fail closed. If the model can't find evidence to pass an axis, that axis fails, full stop. Most people build rubrics that pass unless something looks wrong, which means a model that's unsure waves the change through. Flip it. Unsure equals fail. You'd rather re-examine a fine diff than merge a broken one because the grader hedged. The asymmetry of cost is the whole argument.
Variables you'll set
| Variable | Required | What it is |
|---|---|---|
{{goal}} | Yes | The exact task the agent was asked to do |
{{diff}} | Yes | The full change the agent produced |
{{conventions}} | No | Repo conventions: test style, naming, error handling |
A caveat on trust: a rubric prompt is a reviewer, not an oracle. It can misjudge a clever change as a fail, or miss a subtle bug the diff hides. Use it to focus human attention, especially on the failing axis, rather than as the final gate. And spot-check its calibration after a model update, because grading strictness drifts between versions.
Getting started
- Copy the exact goal the agent was given, word for word.
- Generate the full diff of the agent's change.
- Note three or four conventions the change should respect.
- Paste the rubric prompt and fill
{{goal}},{{diff}},{{conventions}}. - Read the per-axis verdicts and their evidence. Send back on any fail.
- Re-grade the corrected diff until all three axes pass with evidence.
- Save the rubric so every agent change meets the same bar. The Agent Code Output Verification Rubric ships this three-axis grader ready to paste.
The Agent Code Output Verification Rubric does this end-to-end: a {{diff}} variable feeds a verifier that grades outcome, process, and style on separate pass/fail rows with required evidence per verdict, fails closed when evidence is missing, and restates the verdict format for GPT-4o so the table holds. It's part of The Complete AI Prompts Bundle, a one-time lifetime license to the whole catalog plus every pack added later, which is the better deal once you run more than one of these agent jobs.
Grading is easier when the agent had a plan to grade against, which is why the task decomposition prompt for coding agents pairs so well with this: each subtask's done-test becomes a verdict. And a rubric grades the change after the fact, while spec-driven development prompts for agents shape the change before it's written. Still deciding whether a pack beats DIY? How to choose a reusable AI prompt pack lays it out.
See the Spec-to-Code Harness →Common questions
How do you evaluate AI coding agent output without a harness?
When do you need a real eval harness instead of a rubric?
Does the verification rubric work across models?
Get the prompt packs this guide is built on
Ready-to-paste prompts with documented variables and worked examples for ChatGPT, Claude, and Gemini. One-time payment, own it forever.
More prompt guides

A Production Readiness Review Prompt That Grades a Service
A service ships, and two weeks later it pages someone at 3 a.m. because nobody asked whether it had alerting before launch. The production readiness review checklist exists to catch that. Most teams k…

Write an AI Code Review Prompt That Actually Finds Bugs
A developer pastes a 400-line diff into ChatGPT, types "review this," and gets back three friendly paragraphs ending in "overall this looks solid." The off-by-one in the pagination loop is still there…

An AI PR Review Prompt Template for Clean Diffs
The difference between a PR review that catches the regression and one that waves it through usually isn't the model. It's whether the prompt has a workflow or just a wish. "Review this pull request"…