Skip to main content
CODE REVIEWAI PROMPTSCLAUDE PROMPTSPROMPT DESIGN

Write an AI Code Review Prompt That Actually Finds Bugs

Most AI code review prompts return polite nothing. Here's how to build an ai code review prompt with a locked output contract that catches real defects.

PPromptsCart Team·June 5, 2026·Updated June 14, 2026·7 min read

A developer pastes a 400-line diff into ChatGPT, types "review this," and gets back three friendly paragraphs ending in "overall this looks solid." The off-by-one in the pagination loop is still there. So is the unhandled null from the new API call. The model didn't lie. It was never told what a bug is.

That's the gap most advice about an ai code review prompt skips right past. The popular lists hand you a wall of instructions and hope. They don't lock what the model returns, so the same prompt produces a checklist one run and a pep talk the next. A review you can't trust is worse than no review, because it lulls you into shipping.

This post is about building a code review prompt that finds defects on purpose, returns them in a shape you can paste into a PR, and behaves the same way on Tuesday as it did on Monday.

Why generic review prompts return polite nothing

Ask a model to "review this code" and you've handed it an undefined job. Review for what? Correctness? Style? Security? Performance? It picks whatever feels safe, which is usually a summary plus encouragement.

Three things go wrong, every time:

  1. No dimensions. The model doesn't know correctness, security, and test coverage are all in scope, so it covers one and calls it done.
  2. No severity scale. Everything reads as equally important, so a missing semicolon and a SQL injection land in the same bland paragraph.
  3. No output shape. Free-form prose can't be sorted, filtered, or pasted as line comments. You read it once and lose it.

Compare that to how a senior reviewer actually works. They scan for a fixed set of failure modes, rank what they find, and write each comment as "here, this severity, because, fix it like this." That structure is the whole job. A prompt that doesn't encode it is just chat.

The one-line test

If you strip the output contract out of your review prompt and the result still reads identically, your prompt has no contract. The model is improvising the format every run, and improvised formats drift.

Anatomy of a review prompt that holds

The fix is structural. A working review prompt is three layers stacked: the dimensions to check, the severity scale to rank them, and the output contract that forces every finding into the same row.

Variables → {{diff}}, {{repo_conventions}}, {{focus_areas}}
Role      → You are a senior reviewer enforcing a fixed standard.
Dimensions→ correctness, security, error handling, tests, readability
Severity  → blocking | major | minor | nit
Output    → one row per finding:
            FILE:LINE | SEVERITY | DIMENSION | why it's wrong | suggested fix
            then a blocking-issues count and a merge verdict

The {{diff}} variable is what you reuse. The rest is the standard, written once. Notice the output contract sits at the end. That placement isn't cosmetic. Models weight the most recent tokens, so the format spec belongs after the pasted diff, not before it. Put the contract first and a long diff buries it; the model forgets the shape it promised by the time it's done reading.

Model behavior: Claude vs GPT-4o for review

The same prompt doesn't behave the same way across models, and pretending it does is how reviews silently degrade.

BehaviorClaudeGPT-4o
Honors a ## Output format headingReliably, even across 500-line diffsDrifts to prose after ~5 files unless restated
Severity disciplineHolds the scale you definedInflates everything to "major" without anchors
Long-diff staminaKeeps the row format to the endNeeds the contract repeated on the final line
Refusing to invent issuesStrong with an explicit "report nothing if clean" lineWill pad with nits to look thorough

The practical takeaway: a system prompt that locks the standard once works better on Claude. On GPT-4o, restate the output contract near the end of every call. Both need an explicit instruction that finding nothing is a valid result, or they'll manufacture filler to seem useful.

This is the kind of model-by-model detail the generic prompt lists never give you, and it's exactly why a packaged, tested prompt beats a copied snippet.

Prompt-craft patterns that lift bug yield

Pattern 1: anchor the severity scale

Don't just name the levels. Define them, or the model invents its own.

Severity:
- blocking: data loss, security hole, crash, or wrong result for valid input
- major: likely bug under common conditions, or missing error handling
- minor: works but fragile; would fail review at a strict shop
- nit: style or naming; never blocks merge

With anchors, the model stops calling a variable rename "major." Without them, severity is noise.

Pattern 2: force evidence per finding

Make every row cite the line and the reason. "Looks risky" isn't a finding. "Line 84 dereferences user before the null check added on line 79 runs" is. Requiring a why column kills the model's instinct to gesture vaguely.

Pattern 3: separate the verdict from the findings

End with a hard count: blocking issues found, merge verdict (BLOCK or APPROVE). That single line is what a CI gate or a teammate reads first. Bury it in prose and it's useless.

Opinion worth holding

Stop asking the model to "be thorough." Thoroughness isn't a quality you can request; it's a structure you impose. A tight five-dimension checklist with anchored severities finds more real bugs than a paragraph begging for diligence, and it costs fewer tokens.

Variables you'll set

VariableRequiredWhat it is
{{diff}}YesThe unified diff or full file under review
{{repo_conventions}}NoRepo-specific rules the review must enforce
{{focus_areas}}NoNarrow the review (e.g. "concurrency, auth")
{{severity_threshold}}NoThe level that blocks merge; defaults to blocking

Keep the variable set small. A review prompt with twelve knobs is one nobody fills in correctly. Four is plenty, and only one is mandatory.

Getting started

  1. Write the five dimensions your team actually cares about. Not ten. Five.
  2. Define each severity level with a concrete anchor, not a synonym.
  3. Put {{diff}} near the top, the output contract on the last line.
  4. Add the line: "If no issues at or above the threshold, return an empty findings table and APPROVE."
  5. Test it on a diff you already know has a bug. Did it catch it? Did it rank it right?
  6. Save the working version as a system prompt so every reviewer, human or agent, runs the same standard. The Code Review Policy System Prompt packages exactly this: four prompts covering dimensions, severity tiers, repo-specific must-checks, and the actionable comment format.
See the Code Review Policy System Prompt

If your job is the whole pull request rather than a single file, the Pull Request Review Workflow Pack turns a full diff into a prioritised report across architecture, defect risk, security, and test gaps.

Skip the setup

The Code Review Policy System Prompt does this end-to-end. It encodes the severity tiers and the location-severity-rationale-fix comment format as a reusable system prompt, so a fresh agent or a new hire reviews to the same bar from day one. It's part of The Complete AI Prompts Bundle, a one-time lifetime license to the whole catalog plus every pack added later, which pays off fast if you also run security audits or CI-gate reviews.

Get the Code Review Policy System Prompt

A review prompt is only as good as the standard it carries. Once you've got correctness and severity locked, the next jobs are narrower: the diff-to-verdict workflow in the PR review prompt template, and the vulnerability-specific lens in the security code review prompt mapped to CWE. Different jobs, same principle: name the dimensions, anchor the severity, lock the output.

Browse the developer prompt packs
FAQ

Common questions

What makes a good AI code review prompt?
A good ai code review prompt names the review dimensions, fixes a severity scale, and locks the output to one row per finding (location, severity, why, suggested fix). Without that contract the model returns vague prose that nobody can act on.
Why does ChatGPT say my code looks great when it has bugs?
General chat defaults to agreeable summaries. The model wasn't told what counts as a defect or what severity means, so it reaches for praise. Give it explicit dimensions and a blocking-versus-advisory threshold and the tone changes immediately.
Does Claude or GPT-4o review code better?
Claude follows a structured severity rubric in a system prompt more consistently across long diffs. GPT-4o needs the output format restated near the end of the prompt or it drifts back to narrative after the first few files.
Stop reading. Start shipping.

Get the prompt packs this guide is built on

Ready-to-paste prompts with documented variables and worked examples for ChatGPT, Claude, and Gemini. One-time payment, own it forever.