Turn an API Design Review Checklist into a Scored Prompt
A static api design review checklist tells you what to look at. A scored prompt rubric grades the design and emits pass/fail per dimension with evidence from the spec.
A team ships an API, consumers integrate, and six months later everyone's stuck with getUserData, fetchUserInfo, and user_details as three endpoints that return overlapping shapes. The design review was a meeting where someone skimmed the spec and said "looks fine." There was a checklist somewhere. Nobody scored against it.
That's the weakness of a prose api design review checklist: it lists what to look at and leaves the judgment to whoever's in the room that day. The same spec passes on Monday and fails on Friday depending on who's reviewing. The fix is to make the checklist executable — a scored prompt rubric that grades each dimension, cites evidence from the spec, and ranks fixes by how much they'll hurt consumers.
Why a static checklist drifts
A checklist is a memory aid, not a standard. It tells you to "check error handling" but not what good looks like or how much it matters. Three problems follow:
- No weights. A naming nit and a missing pagination contract sit as equal checkboxes, so reviewers spend the same energy on both.
- No evidence. "Versioning: OK" records an opinion with nothing behind it. Six months on, nobody knows why it was OK.
- No ranking. The output is a flat list of ticks and crosses, so the team fixes the easy things and ships the painful ones.
A scored rubric closes each gap. Weighted dimensions force priority. An evidence requirement ties every score to a line in the spec. A risk-ranked fix list puts consumer pain first.
An API design review rubric is a prompt that scores a spec across weighted dimensions — consistency, usability, versioning, scalability, operations — and returns a pass/fail per dimension with cited evidence and a prioritised fix list. The weights and evidence are what make it repeatable instead of subjective.
Anatomy of the scored rubric prompt
Variables → {{api_spec}}, {{consumer_scenarios}}, {{weights}}
Role → API reviewer applying a fixed weighted rubric.
Dimensions→ consistency, usability, versioning, scalability, operations
Per dimension → score 1-5, weight, evidence (quote the spec), pass/fail
Composite → weighted score and an overall verdict
Fix list → ordered by consumer pain, each with the dimension it lifts
The {{consumer_scenarios}} variable is the one teams forget. A design that scores well in the abstract can still be miserable for the actual integration paths. Feed the model "a mobile client paginating 10k records on a slow connection" and pagination problems that looked theoretical become blocking. Evidence beats taste, and consumer scenarios are where the evidence lives.
Model behavior when scoring a spec
Scoring is a different task than free-form review, and models handle it differently.
Claude is steady at holding a 1-to-5 scale with anchors and at quoting the spec as evidence rather than paraphrasing. GPT-4o scores fluently too, but without anchored criteria it clusters everything around 3-4, which makes the rubric useless. Both models will invent spec details that aren't there if the spec is incomplete, so an explicit "if the spec doesn't say, score it as a gap, don't assume" instruction keeps them honest. That one line prevents the most common failure: a generous score for behavior the API never actually documents.
| Behavior | Claude | GPT-4o |
|---|---|---|
| Holds an anchored 1-5 scale | Reliable | Clusters mid-scale without anchors |
| Quotes spec as evidence | Strong | Paraphrases unless told to quote |
| Invents missing detail | Rare with the "score gaps as gaps" rule | Needs the rule stated explicitly |
| Ranks fixes by impact | Good with consumer scenarios | Good with consumer scenarios |
Rank the fix list by consumer pain, never by reviewer effort. The temptation is to lead with quick wins, but a quick win that no consumer feels is theater. The pagination contract that's annoying to add but saves every mobile client belongs at the top, even though it's the hard one.
Prompt-craft patterns for design rubrics
Pattern 1: anchor every score
5: meets REST/HTTP best practice with no gaps a consumer would hit
3: usable but with a documented rough edge
1: actively misleading or guaranteed to break a common client
Without anchors, a 3 means nothing. With them, two reviewers (or two runs) land in the same place.
Pattern 2: require a spec quote per finding
Make the model paste the offending line. "Inconsistent error format" is an opinion; the two different error bodies quoted side by side is proof. Quotes also make the review auditable later.
Pattern 3: separate score from fix
Score the design as it is. Then, separately, list what would raise each failing dimension. Mixing the two produces hedged scores ("it's a 3, but if you fixed X it'd be a 5"), which defeats the grading.
Variables you'll set
| Variable | Required | What it is |
|---|---|---|
{{api_spec}} | Yes | The OpenAPI doc, schema, or endpoint definitions |
{{consumer_scenarios}} | No | Real integration paths to score usability against |
{{weights}} | No | Per-dimension weights if the defaults don't fit |
{{standard}} | No | House style or REST conventions to enforce |
Getting started
- Fix your five dimensions and their weights before you read a single spec.
- Anchor the 1-to-5 scale with concrete descriptions, not adjectives.
- Paste the spec into
{{api_spec}}and the real integration paths into{{consumer_scenarios}}. - Add "score undocumented behavior as a gap; never assume intent."
- Read the fix list. Is the highest-consumer-pain item first, even if it's the hard one?
- Save the rubric so every API clears the same bar. The API Design Review Evaluation Rubric ships this: five scored dimensions with weights and a pass/fail verdict, each backed by cited spec evidence, ending in a fix list ordered by consumer pain.
A design rubric checks the shape of the API. To guard against breaking that shape later, the API Contract Test Harness Pack generates tests that fail the moment a breaking change reaches the producer.
The API Design Review Evaluation Rubric does this end-to-end. It scores consistency, usability, versioning, scalability, and operations against explicit criteria with weights, cites the spec for every score, and outputs a prioritised fix list — so a review is a graded artifact, not a meeting opinion. It's part of The Complete AI Prompts Bundle, a one-time lifetime license to the whole catalog plus future packs, worth it once you're reviewing more than one service's API.
A weighted rubric is the same machine whether you're grading an API or a whole repository. For the codebase-level version, see the repo health scorecard prompt. And if your API review is one gate inside a larger PR flow, the AI PR review prompt template shows how the verdict slots in.
Browse the developer prompt packs →Common questions
What should an API design review checklist cover?
Why turn a checklist into a prompt rubric?
Can AI review an OpenAPI spec for design quality?
Get the prompt packs this guide is built on
Ready-to-paste prompts with documented variables and worked examples for ChatGPT, Claude, and Gemini. One-time payment, own it forever.
More prompt guides

A Production Readiness Review Prompt That Grades a Service
A service ships, and two weeks later it pages someone at 3 a.m. because nobody asked whether it had alerting before launch. The production readiness review checklist exists to catch that. Most teams k…

Write an AI Code Review Prompt That Actually Finds Bugs
A developer pastes a 400-line diff into ChatGPT, types "review this," and gets back three friendly paragraphs ending in "overall this looks solid." The off-by-one in the pagination loop is still there…

An AI PR Review Prompt Template for Clean Diffs
The difference between a PR review that catches the regression and one that waves it through usually isn't the model. It's whether the prompt has a workflow or just a wish. "Review this pull request"…