A Production Readiness Review Prompt That Grades a Service
Turn your production readiness review checklist into a prompt that scores reliability and security, then returns a ranked gap list. Copy the rubric today.
A service ships, and two weeks later it pages someone at 3 a.m. because nobody asked whether it had alerting before launch. The production readiness review checklist exists to catch that. Most teams keep one as a static doc, tick a few boxes, and move on. The boxes don't grade anything.
A prompt-driven version reads your service description and actually scores it. Reliability, observability, scalability, security, operational readiness — each gets a grade, an evidence line, and a pass-or-fail verdict. The gaps come back ranked by impact. That's the difference between a checklist you skim and one that tells you what's actually going to break.
The org checklists that rank for this query are thorough and completely static. None of them grade a specific service or hand you a remediation plan. That's the gap.
What a production readiness rubric covers
A production readiness review checklist is a pre-launch assessment that scores a service across the dimensions that predict whether it survives contact with real traffic. As a prompt, it turns those dimensions into a weighted rubric with explicit criteria.
The five dimensions worth grading:
- Reliability: failure modes, retries, graceful degradation, SLOs
- Observability: logs, metrics, traces, and alerts that fire before users notice
- Scalability: load behavior, resource limits, the obvious bottleneck
- Security: authn/authz, secret handling, the exposed surface
- Operational readiness: runbooks, on-call, rollback path
Each one is a buyer job a launching team has to answer. Bundle them into one rubric and you answer all five in a single pass.
Anatomy: rubric in, scored verdict out
The prompt frames the model as a launch reviewer, takes the service description in a variable, and locks the output to per-dimension scores.
Variables
{{service_description}} — architecture, deps, traffic profile
{{launch_context}} — internal tool vs public API, expected load
Prompt
Role: You are an SRE running a production readiness review.
Task: Score {{service_description}} against the five dimensions.
Weight reliability and security highest. Cite evidence
for every score; if evidence is missing, score it a gap.
Output contract
For each dimension:
score: 1-5
evidence: what in the description justifies it
gaps: what's missing or risky
overall: PASS | CONDITIONAL | BLOCK
remediation: ranked list, each with effort estimate
The evidence field does the heavy lifting. When the description doesn't mention alerting, the model can't cite any, so observability scores low automatically. Absence becomes a gap instead of a generous benefit of the doubt.
The most common mistake is letting the model assume good defaults. Instruct it explicitly: if the service description doesn't state that something exists, treat it as absent and score it down. A readiness review that gives credit for unstated capabilities isn't a review. It's wishful thinking.
Step-by-step: grading a service
1. Write the service description
A few paragraphs: what it does, its dependencies, expected traffic, how it's deployed. The richer {{service_description}} is, the less the model guesses.
2. Set the launch context
An internal cron job and a public payments API don't share a bar. {{launch_context}} tells the rubric how hard to grade.
3. Run the rubric
You get five scored dimensions, each with evidence and gaps, plus an overall verdict.
4. Read CONDITIONAL carefully
CONDITIONAL is the most useful verdict. It means launchable with named conditions. Those conditions are your pre-launch task list.
5. Work the remediation plan
The ranked remediation list is the output you act on. Highest-impact, lowest-effort gaps rise to the top.
Patterns that keep the scoring honest
Weight the dimensions explicitly. A security gap on a public API should outweigh a docs gap. State the weights in the prompt so the overall verdict reflects real risk, not an unweighted average.
Force evidence before score. Order the output so evidence comes before score. A model that writes the justification first scores more consistently than one that picks a number then backfills a reason.
Separate score from remediation. Grade first, fix second. Mixing them produces a verdict contaminated by optimism about how easy the fixes are. Keep the two phases distinct in the contract.
Variables you'll set
| Variable | Required | What it is |
|---|---|---|
{{service_description}} | Yes | Architecture, dependencies, traffic, deploy model |
{{launch_context}} | Yes | Internal tool vs public API; expected load |
{{org_standards}} | No | Your team's specific must-haves to fold into the rubric |
An opinion worth holding
The unweighted readiness checklist is a comfort blanket. Ten dimensions, all equal, all green, ship it. But a service can pass nine boxes and still take down production on the one that mattered. Weight the rubric toward the dimensions that actually cause incidents on your stack, usually reliability and security, and accept a lower score elsewhere. A blunt all-equal checklist hides the one risk you should've blocked on.
Getting started
- Copy the rubric anatomy into your model of choice.
- Write a real
{{service_description}}for something you're about to launch. - Set
{{launch_context}}honestly. - Run it and read the overall verdict.
- Treat every
CONDITIONALcondition as a pre-launch task. - Re-run after fixes to confirm the verdict flips to PASS.
For a packaged version with the weights and evidence-review checklist already built, the Production Readiness Review Rubric scores all five dimensions and turns failing scores into a prioritized remediation plan with effort estimates and owners.
Browse the review prompt packs →The Production Readiness Review Rubric does this end-to-end: a weighted five-dimension rubric with a structured evidence-review checklist that grounds every score in observable facts, plus a remediation plan you can hand to owners. It's part of The Complete AI Prompts Bundle, a one-time lifetime license to the whole catalog plus every pack added later, worth it if you review more than one service a quarter.
If you want to grade the codebase too, not just the running service, the Repo Health Scorecard Rubric scores tests, docs, CI, and dependencies on the same evidence-first model. For more on reusable rubric design, read how to choose a reusable AI prompt pack and the related repo health scorecard prompt for any codebase.
Common questions
What is a production readiness review checklist?
Can an AI prompt run a production readiness review?
How is a rubric prompt different from a static checklist?
Get the prompt packs this guide is built on
Ready-to-paste prompts with documented variables and worked examples for ChatGPT, Claude, and Gemini. One-time payment, own it forever.
More prompt guides

Write an AI Code Review Prompt That Actually Finds Bugs
A developer pastes a 400-line diff into ChatGPT, types "review this," and gets back three friendly paragraphs ending in "overall this looks solid." The off-by-one in the pagination loop is still there…

An AI PR Review Prompt Template for Clean Diffs
The difference between a PR review that catches the regression and one that waves it through usually isn't the model. It's whether the prompt has a workflow or just a wish. "Review this pull request"…

The AI Prompt to Review a Pull Request (With a Findings Contract)
A pull request review prompt that you retype from scratch every time isn't a workflow. It's a habit you'll skip the moment you're busy. The reusable version, with a real AI security code review prompt…