Write an AI Code Review Prompt That Actually Finds Bugs
Most AI code review prompts return polite nothing. Here's how to build an ai code review prompt with a locked output contract that catches real defects.
A developer pastes a 400-line diff into ChatGPT, types "review this," and gets back three friendly paragraphs ending in "overall this looks solid." The off-by-one in the pagination loop is still there. So is the unhandled null from the new API call. The model didn't lie. It was never told what a bug is.
That's the gap most advice about an ai code review prompt skips right past. The popular lists hand you a wall of instructions and hope. They don't lock what the model returns, so the same prompt produces a checklist one run and a pep talk the next. A review you can't trust is worse than no review, because it lulls you into shipping.
This post is about building a code review prompt that finds defects on purpose, returns them in a shape you can paste into a PR, and behaves the same way on Tuesday as it did on Monday.
Why generic review prompts return polite nothing
Ask a model to "review this code" and you've handed it an undefined job. Review for what? Correctness? Style? Security? Performance? It picks whatever feels safe, which is usually a summary plus encouragement.
Three things go wrong, every time:
- No dimensions. The model doesn't know correctness, security, and test coverage are all in scope, so it covers one and calls it done.
- No severity scale. Everything reads as equally important, so a missing semicolon and a SQL injection land in the same bland paragraph.
- No output shape. Free-form prose can't be sorted, filtered, or pasted as line comments. You read it once and lose it.
Compare that to how a senior reviewer actually works. They scan for a fixed set of failure modes, rank what they find, and write each comment as "here, this severity, because, fix it like this." That structure is the whole job. A prompt that doesn't encode it is just chat.
If you strip the output contract out of your review prompt and the result still reads identically, your prompt has no contract. The model is improvising the format every run, and improvised formats drift.
Anatomy of a review prompt that holds
The fix is structural. A working review prompt is three layers stacked: the dimensions to check, the severity scale to rank them, and the output contract that forces every finding into the same row.
Variables → {{diff}}, {{repo_conventions}}, {{focus_areas}}
Role → You are a senior reviewer enforcing a fixed standard.
Dimensions→ correctness, security, error handling, tests, readability
Severity → blocking | major | minor | nit
Output → one row per finding:
FILE:LINE | SEVERITY | DIMENSION | why it's wrong | suggested fix
then a blocking-issues count and a merge verdict
The {{diff}} variable is what you reuse. The rest is the standard, written once. Notice the output contract sits at the end. That placement isn't cosmetic. Models weight the most recent tokens, so the format spec belongs after the pasted diff, not before it. Put the contract first and a long diff buries it; the model forgets the shape it promised by the time it's done reading.
Model behavior: Claude vs GPT-4o for review
The same prompt doesn't behave the same way across models, and pretending it does is how reviews silently degrade.
| Behavior | Claude | GPT-4o |
|---|---|---|
Honors a ## Output format heading | Reliably, even across 500-line diffs | Drifts to prose after ~5 files unless restated |
| Severity discipline | Holds the scale you defined | Inflates everything to "major" without anchors |
| Long-diff stamina | Keeps the row format to the end | Needs the contract repeated on the final line |
| Refusing to invent issues | Strong with an explicit "report nothing if clean" line | Will pad with nits to look thorough |
The practical takeaway: a system prompt that locks the standard once works better on Claude. On GPT-4o, restate the output contract near the end of every call. Both need an explicit instruction that finding nothing is a valid result, or they'll manufacture filler to seem useful.
This is the kind of model-by-model detail the generic prompt lists never give you, and it's exactly why a packaged, tested prompt beats a copied snippet.
Prompt-craft patterns that lift bug yield
Pattern 1: anchor the severity scale
Don't just name the levels. Define them, or the model invents its own.
Severity:
- blocking: data loss, security hole, crash, or wrong result for valid input
- major: likely bug under common conditions, or missing error handling
- minor: works but fragile; would fail review at a strict shop
- nit: style or naming; never blocks merge
With anchors, the model stops calling a variable rename "major." Without them, severity is noise.
Pattern 2: force evidence per finding
Make every row cite the line and the reason. "Looks risky" isn't a finding. "Line 84 dereferences user before the null check added on line 79 runs" is. Requiring a why column kills the model's instinct to gesture vaguely.
Pattern 3: separate the verdict from the findings
End with a hard count: blocking issues found, merge verdict (BLOCK or APPROVE). That single line is what a CI gate or a teammate reads first. Bury it in prose and it's useless.
Stop asking the model to "be thorough." Thoroughness isn't a quality you can request; it's a structure you impose. A tight five-dimension checklist with anchored severities finds more real bugs than a paragraph begging for diligence, and it costs fewer tokens.
Variables you'll set
| Variable | Required | What it is |
|---|---|---|
{{diff}} | Yes | The unified diff or full file under review |
{{repo_conventions}} | No | Repo-specific rules the review must enforce |
{{focus_areas}} | No | Narrow the review (e.g. "concurrency, auth") |
{{severity_threshold}} | No | The level that blocks merge; defaults to blocking |
Keep the variable set small. A review prompt with twelve knobs is one nobody fills in correctly. Four is plenty, and only one is mandatory.
Getting started
- Write the five dimensions your team actually cares about. Not ten. Five.
- Define each severity level with a concrete anchor, not a synonym.
- Put
{{diff}}near the top, the output contract on the last line. - Add the line: "If no issues at or above the threshold, return an empty findings table and APPROVE."
- Test it on a diff you already know has a bug. Did it catch it? Did it rank it right?
- Save the working version as a system prompt so every reviewer, human or agent, runs the same standard. The Code Review Policy System Prompt packages exactly this: four prompts covering dimensions, severity tiers, repo-specific must-checks, and the actionable comment format.
If your job is the whole pull request rather than a single file, the Pull Request Review Workflow Pack turns a full diff into a prioritised report across architecture, defect risk, security, and test gaps.
The Code Review Policy System Prompt does this end-to-end. It encodes the severity tiers and the location-severity-rationale-fix comment format as a reusable system prompt, so a fresh agent or a new hire reviews to the same bar from day one. It's part of The Complete AI Prompts Bundle, a one-time lifetime license to the whole catalog plus every pack added later, which pays off fast if you also run security audits or CI-gate reviews.
A review prompt is only as good as the standard it carries. Once you've got correctness and severity locked, the next jobs are narrower: the diff-to-verdict workflow in the PR review prompt template, and the vulnerability-specific lens in the security code review prompt mapped to CWE. Different jobs, same principle: name the dimensions, anchor the severity, lock the output.
Browse the developer prompt packs →Common questions
What makes a good AI code review prompt?
Why does ChatGPT say my code looks great when it has bugs?
Does Claude or GPT-4o review code better?
Get the prompt packs this guide is built on
Ready-to-paste prompts with documented variables and worked examples for ChatGPT, Claude, and Gemini. One-time payment, own it forever.
More prompt guides

A Production Readiness Review Prompt That Grades a Service
A service ships, and two weeks later it pages someone at 3 a.m. because nobody asked whether it had alerting before launch. The production readiness review checklist exists to catch that. Most teams k…

An AI PR Review Prompt Template for Clean Diffs
The difference between a PR review that catches the regression and one that waves it through usually isn't the model. It's whether the prompt has a workflow or just a wish. "Review this pull request"…

The AI Prompt to Review a Pull Request (With a Findings Contract)
A pull request review prompt that you retype from scratch every time isn't a workflow. It's a habit you'll skip the moment you're busy. The reusable version, with a real AI security code review prompt…