Skip to main content
AGENT PROMPTSAI AGENTSCODINGCLAUDE PROMPTS

How to Evaluate AI Coding Agent Output Without Building a Harness

No eval code needed to grade an agent's diff. Learn how to evaluate AI coding agent output with a pasteable rubric for outcome, process, and style. Copy it.

PPromptsCart Team·December 5, 2025·Updated June 14, 2026·7 min read

The agent finished. The diff looks plausible. Now what? Most teams either merge on a glance or stand up an eval harness they never finish. There's a middle path almost nobody ships: a prompt that grades the change. Learning to evaluate AI coding agent output with a rubric, not a codebase, is the cheapest reliability win available.

Everything that ranks assumes you'll write eval code. Braintrust's agent-evaluation article covers trajectory evals. The Medium tutorial on building an eval harness walks the build. The awesome-harness-engineering list collects more of the same. All of it points at code. None of it gives a reviewer a prompt to grade a diff against outcome, process, and style without standing up infrastructure.

That's the gap a verification rubric fills. Paste the diff, paste the goal, get a verdict.

What a verification rubric produces

A verification rubric is a prompt that grades one agent change against fixed axes and returns a pass or fail per axis with the evidence behind each call. It's a structured second opinion you can run in seconds.

The job it does on every diff:

  • Check the outcome: did the change actually meet the stated goal
  • Check the process: did the agent stay inside the constraints and scope
  • Check the style: does the code match the conventions already in the repo
  • Cite evidence for each verdict, not just a thumbs up
  • Catch the silent scope creep (the drive-by refactor nobody asked for)
  • Return a clear merge / fix / reject signal
  • Apply the same standard whether Claude, ChatGPT, or Gemini wrote the code

The three axes matter because they fail independently. A change can hit the goal (outcome pass) while quietly rewriting an unrelated module (process fail). Grading them separately surfaces exactly that.

The anatomy of the rubric prompt

The prompt takes the goal, the diff, and the repo conventions, then returns a per-axis verdict table.

Variables → {{goal}}, {{diff}}, {{conventions}}
Prompt    → role: strict verifier grading a single agent change
            task: judge outcome, process, style independently
            rule: every verdict cites evidence from the diff
Output    → table: axis | verdict (pass/fail) | evidence | required fix

Put the verdict contract last. On a long {{diff}}, the rubric format stated up top gets out-weighted by recent tokens and the model reverts to a friendly summary. Restating the table format on the final line keeps the verdicts crisp, which matters most on GPT-4o.

1. State the goal the agent was given

Paste the original task verbatim. The rubric can't grade outcome against a goal it doesn't have. Vague goal in, vague verdict out.

2. Fill the variables

Drop the goal into {{goal}}, the diff into {{diff}}, and a short conventions note into {{conventions}} (test style, naming, error handling).

3. Run the rubric

You get a table with a verdict per axis and the evidence each verdict rests on. Read the evidence, not just the pass/fail. A pass with weak evidence ("seems fine") is really an unsure verdict wearing a green checkmark, and on a fail-closed rubric that should resolve to a fail. The evidence column is the part that keeps the grader honest. Skim it the way you'd skim a junior reviewer's reasoning: not to second-guess every call, but to catch the one where the justification doesn't actually support the verdict.

4. Act on the weakest axis

One fail is enough to send it back. A process fail (touched files it shouldn't) is often more dangerous than a style fail, even though it's less visible. Style you'll notice on read. Scope creep hides in files you didn't think to open, which is exactly why the rubric grades them as a separate axis instead of folding everything into one "looks good" verdict. Read the evidence line for any failing axis before you decide how to push back. A process fail with the evidence "edited auth/session.ts, not named in the goal" is a different conversation than a style fail over a variable name, and the fix is usually faster too: revert the out-of-scope change rather than rewrite the logic.

5. Re-grade after the fix

Re-run on the corrected diff. The rubric is cheap, so grade every iteration, not just the first.

Outcome pass, process fail is the dangerous one

The change that meets its goal while quietly editing files outside scope is the one that bites you later. A rubric that grades process separately from outcome catches it. A single "looks good?" prompt never will, because the diff genuinely does solve the problem you asked about.

Prompt-craft patterns for honest grading

Two patterns keep the rubric from rubber-stamping, plus a hard rule.

Evidence-required verdicts. No verdict without a quote from the diff.

For each axis, output:
- verdict: pass or fail
- evidence: a specific line or change from {{diff}} that justifies it
A verdict with no evidence is invalid; default that axis to fail.

Define what fail looks like. Lenient models pass everything unless you show them a failing example.

Process FAIL examples: edited files unrelated to {{goal}};
added a dependency not requested; changed public API without being asked.

The opinion that'll save you the most grief: make the rubric fail closed. If the model can't find evidence to pass an axis, that axis fails, full stop. Most people build rubrics that pass unless something looks wrong, which means a model that's unsure waves the change through. Flip it. Unsure equals fail. You'd rather re-examine a fine diff than merge a broken one because the grader hedged. The asymmetry of cost is the whole argument.

Variables you'll set

VariableRequiredWhat it is
{{goal}}YesThe exact task the agent was asked to do
{{diff}}YesThe full change the agent produced
{{conventions}}NoRepo conventions: test style, naming, error handling

A caveat on trust: a rubric prompt is a reviewer, not an oracle. It can misjudge a clever change as a fail, or miss a subtle bug the diff hides. Use it to focus human attention, especially on the failing axis, rather than as the final gate. And spot-check its calibration after a model update, because grading strictness drifts between versions.

Getting started

  1. Copy the exact goal the agent was given, word for word.
  2. Generate the full diff of the agent's change.
  3. Note three or four conventions the change should respect.
  4. Paste the rubric prompt and fill {{goal}}, {{diff}}, {{conventions}}.
  5. Read the per-axis verdicts and their evidence. Send back on any fail.
  6. Re-grade the corrected diff until all three axes pass with evidence.
  7. Save the rubric so every agent change meets the same bar. The Agent Code Output Verification Rubric ships this three-axis grader ready to paste.
Browse the agent prompt packs
Skip the setup

The Agent Code Output Verification Rubric does this end-to-end: a {{diff}} variable feeds a verifier that grades outcome, process, and style on separate pass/fail rows with required evidence per verdict, fails closed when evidence is missing, and restates the verdict format for GPT-4o so the table holds. It's part of The Complete AI Prompts Bundle, a one-time lifetime license to the whole catalog plus every pack added later, which is the better deal once you run more than one of these agent jobs.

Get the Agent Code Output Verification Rubric

Grading is easier when the agent had a plan to grade against, which is why the task decomposition prompt for coding agents pairs so well with this: each subtask's done-test becomes a verdict. And a rubric grades the change after the fact, while spec-driven development prompts for agents shape the change before it's written. Still deciding whether a pack beats DIY? How to choose a reusable AI prompt pack lays it out.

See the Spec-to-Code Harness
FAQ

Common questions

How do you evaluate AI coding agent output without a harness?
Use a verification rubric prompt. It grades the agent's diff against three axes: outcome (did it meet the goal), process (did it follow constraints and touch only what it should), and style (does it match the codebase). You paste the diff and the goal, and the rubric returns a pass/fail per axis with evidence, no eval code required.
When do you need a real eval harness instead of a rubric?
When you're testing the agent itself across many tasks, repeatedly, you want a coded harness with fixtures. For grading a single change before you merge it, a rubric prompt is faster and needs no setup. The rubric handles the per-PR decision; the harness handles the regression suite.
Does the verification rubric work across models?
Yes. Claude follows a per-axis pass/fail table under an output heading well. GPT-4o needs the verdict format restated near the end of the prompt. Gemini is lenient by default, so the rubric must define what a fail looks like with concrete examples, or everything passes.
Stop reading. Start shipping.

Get the prompt packs this guide is built on

Ready-to-paste prompts with documented variables and worked examples for ChatGPT, Claude, and Gemini. One-time payment, own it forever.