A Repo Health Scorecard Prompt for Any Codebase
What a repo health scorecard prompt is and how to build one that scores tests, docs, CI, and dependencies — then turns the score into a fundable improvement plan.
A new engineer joins, clones the repo, and spends two days just figuring out how to run it. The tests pass locally but not in CI, the README documents a setup that changed a year ago, and three dependencies are two majors behind. None of this is in a ticket. It's the ambient cost of an unhealthy repo, and nobody measured it until it slowed the team down.
A repo health scorecard prompt measures it on purpose. It scores the repository across the dimensions that actually predict friction — tests, docs, CI, dependencies, structure — and turns the result into a plan a lead can fund. The tools and metric definitions you'll find searching this topic stop at "here's what a health score is." They don't give you a prompt that reads your repo and produces one.
What a repo health scorecard prompt is
It's a reusable prompt that applies a weighted rubric to a codebase and returns a score per dimension, a composite, and a ranked list of what to fix first. The rubric is fixed; the repo is the variable. That's what lets the same prompt grade a Python service and a TypeScript monorepo without rewriting it.
The dimensions that matter aren't mysterious:
- Tests. Do they exist, run, and cover the risky paths?
- Docs. Can someone set up and contribute without asking?
- CI. Is it green, fast, and actually gating?
- Dependencies. Current, pinned, and free of known risk?
- Structure. Can a human or an agent navigate it?
A repo health scorecard prompt scores a repository across weighted dimensions and outputs a number per dimension plus a sequenced improvement plan. It works on any codebase because the standard lives in the rubric, not in the repo, so the same prompt grades wildly different projects on the same scale.
Why a scanning tool isn't the whole answer
Tools like OpenSSF Scorecard are genuinely useful, and they measure things a prompt shouldn't try to: branch protection, signed releases, pinned dependencies. But they only see what's mechanically checkable. They can't read your README and judge whether a stranger could follow it. They can't tell you the test suite is technically present but covers only the happy path.
That subjective layer is where a scorecard prompt earns its keep. It reads the structure and the docs the way a reviewer would, scores the soft dimensions, and writes a sentence per score that a manager can act on. Run both: the scanner for the mechanical signals, the prompt for the judgment ones.
Agent-readiness: the dimension nobody scores
Here's the one most health checks miss. Agent-readiness scores how well an AI coding assistant can work in the repo today. Clear module boundaries, documented conventions, a build that runs from a clean clone, tests an agent can execute and read. It sounds futuristic, but it's just hygiene with a new name. A repo an agent can navigate is almost always one a new hire can navigate too, because both are defeated by the same things: hidden setup steps, undocumented conventions, and a test suite that only the original author can run.
Model behavior when scoring a repo
Scoring a whole repo stresses the model's tendency to guess at things it can't see.
Claude is comfortable saying "this dimension can't be scored from what's provided" rather than inventing a number, which keeps the scorecard honest. GPT-4o scores fluently but will confidently rate test coverage it never actually saw unless you require evidence per score. Both behave far better when you feed them the repo tree and key files rather than asking them to imagine a typical repo. The fix is the same as any rubric: anchor the scale, demand evidence, and explicitly allow "insufficient information" as a result so the model stops bluffing.
| Behavior | Claude | GPT-4o |
|---|---|---|
| Admits when it can't score a dimension | Comfortable | Bluffs unless evidence is required |
| Holds an anchored scale | Reliable | Clusters mid-range without anchors |
| Scores agent-readiness coherently | Strong | Strong with the dimension defined |
| Produces a sequenced plan | Good with effort-vs-impact framing | Good with the framing |
A health score without a funded plan is a vanity metric. The number's only job is to justify the work that follows. So weight the rubric toward what unblocks the team, sequence the fixes by impact over effort, and attach owners. A scorecard that ends in "you're a 6/10" and nothing else gets admired once and ignored forever.
Prompt-craft patterns for repo scoring
Pattern 1: feed structure, don't ask for imagination
Give the model the directory tree, the README, the CI config, and the dependency manifest. A repo scored from these is grounded. A repo scored from "assume a normal project" is fiction.
Pattern 2: require evidence per dimension
Each score cites what it's based on: "tests scored 2 — tests/ exists but covers only utils, no integration tests for the API layer." That sentence is what makes the score defensible and the fix obvious.
Pattern 3: end in a sequenced plan, not a number
The composite score is the headline; the plan is the product. Order fixes by impact over effort, suggest an owner, and give each an acceptance criterion. Now it's fundable.
Variables you'll set
| Variable | Required | What it is |
|---|---|---|
{{repo_tree}} | Yes | The directory structure and key file list |
{{key_files}} | Yes | README, CI config, dependency manifest, sample tests |
{{weights}} | No | Per-dimension weights if the defaults don't fit |
{{context}} | No | Team size or stage, to calibrate what "healthy" means |
Getting started
- Pick five weighted dimensions. Tests, docs, CI, dependencies, structure is a strong default; add agent-readiness if AI assistants touch the repo.
- Anchor the scale and require an evidence sentence per score.
- Feed the model the tree and key files; don't make it guess.
- Allow "insufficient information" so it stops bluffing missing data.
- Demand a sequenced plan with owners and acceptance criteria, not just a number.
- Save the rubric so every repo gets scored the same way. The Repo Health Scorecard Rubric ships this: five weighted dimensions including agent-readiness, gaps ranked by business impact and effort, and a sequenced improvement plan with owner suggestions and acceptance criteria.
Before a service goes live, the related Production Readiness Review Rubric scores reliability, observability, scalability, and security with the same weighted-rubric approach and a pass/fail verdict.
The Repo Health Scorecard Rubric does this end-to-end. It scores five weighted dimensions in one consistent pass, includes the agent-readiness dimension most checks skip, and turns the result into a prioritised, fundable plan with owners and acceptance criteria, so the score actually drives work. It's part of The Complete AI Prompts Bundle, a one-time lifetime license to the whole catalog and every pack added later, which pays off once you're grading more than one repo.
The weighted-rubric pattern here is the same engine that grades an API surface — see the API design review checklist as a scored prompt for that sibling. And if you're weighing whether a packaged rubric beats writing your own, how to choose a reusable AI prompt pack lays out what to look for.
Browse the developer prompt packs →Common questions
What is a repo health scorecard prompt?
How is this different from a tool like OpenSSF Scorecard?
What is agent-readiness in a repo health score?
Get the prompt packs this guide is built on
Ready-to-paste prompts with documented variables and worked examples for ChatGPT, Claude, and Gemini. One-time payment, own it forever.
More prompt guides

A Production Readiness Review Prompt That Grades a Service
A service ships, and two weeks later it pages someone at 3 a.m. because nobody asked whether it had alerting before launch. The production readiness review checklist exists to catch that. Most teams k…

Write an AI Code Review Prompt That Actually Finds Bugs
A developer pastes a 400-line diff into ChatGPT, types "review this," and gets back three friendly paragraphs ending in "overall this looks solid." The off-by-one in the pagination loop is still there…

An AI PR Review Prompt Template for Clean Diffs
The difference between a PR review that catches the regression and one that waves it through usually isn't the model. It's whether the prompt has a workflow or just a wish. "Review this pull request"…