Build an LLM Eval Harness With a Prompt That Designs the Eval Set First
Want to build an llm eval harness without writing YAML first? Use a prompt that designs the eval set, scoring, and regression cases before any harness code.
Most "build an eval harness" guides start at the wrong end. They open with YAML config, a runner library, and a metrics dashboard, then leave the actual eval set as an exercise for the reader. But the runner is the easy 20%. The hard part, the part that decides whether your evals catch anything, is designing what to test and how to score it. If you want to build an llm eval harness that earns its keep, the eval set comes first and the code comes second.
This is a prompt-first approach. The way to build an llm eval harness that holds up is to design the set before the runner: a prompt samples the tasks worth testing, writes pass/fail criteria with real anchors, and defines how a fresh failure becomes a permanent regression case. Then you point promptfoo, EleutherAI, or a fifteen-line script at the set you designed.
That ordering is the whole argument. Get it backwards and you ship a harness that runs fast and measures nothing.
Why most attempts to build an LLM eval harness test the wrong things
The framework essays make this look like an infrastructure problem. The 12-metric framework in this Towards Data Science piece is thoughtful but code-heavy, and the promptfoo walkthrough on Dev.to assumes you already know which 184 prompts to test and what "good" looks like. Academic tooling like EleutherAI's lm-evaluation-harness ships dozens of benchmarks, but those are general-purpose; they don't know your product's failure modes.
None of them designs the eval set for your job. That's the gap. A benchmark you didn't design tests something adjacent to what you ship. The result feels rigorous and catches nothing real.
Here's the stance worth defending: a small eval set you designed by hand, scored against criteria you can defend, beats a big borrowed benchmark every time. Twenty tasks that mirror your actual usage tell you more than two thousand that don't.
What you can do with this prompt
- Turn a description of your agent's job into a representative set of test tasks.
- Generate pass/fail scoring criteria with concrete anchors, not vague "quality" scores.
- Surface edge cases and known failure modes you'd forget to test.
- Convert a production failure into a regression case you keep forever.
- Produce an eval set in a format any runner can consume.
- Decide which metrics actually matter for this agent before writing config.
Anatomy of the prompt
The prompt takes your agent's job and constraints, then emits a structured eval set you can hand to a runner.
Variables:
{{agent_job}} – what the agent is supposed to do
{{example_inputs}} – 3-5 real inputs the agent sees
{{failure_examples}} – known bad outputs, if any
{{scoring_axes}} – correctness, scope, side-effects, etc.
Prompt:
Role: eval engineer designing a test set for a coding agent.
Task: produce the eval SET, not harness code.
Sample tasks across the input distribution + edge cases.
Output contract (restate on the final line):
For each eval case:
- id
- input
- expected behavior (not exact text)
- pass criteria (binary, anchored)
- failure modes to watch
Plus: a rule for promoting new failures to regression cases.
The {{failure_examples}} variable is what makes this an agent eval harness prompt rather than a generic test generator. Real failures teach the eval set where the model actually breaks.
Step-by-step usage
1. Describe the job, not the model
Fill {{agent_job}} with what success looks like for a real user: "refactor a function while preserving its public signature and passing existing tests." Specific jobs produce specific evals. Vague jobs produce vague ones.
2. Paste real inputs
{{example_inputs}} should be three to five inputs the agent genuinely sees, copied from logs if you have them. The model uses these to infer the input distribution and sample around it.
3. Add any known failures
If you've watched the agent break, paste those cases into {{failure_examples}}. The prompt turns each into a pass/fail eval case so the same bug can't slip back.
4. Run it and read the criteria critically
The output is an eval set, not gospel. Read each pass criterion and ask: is this binary and anchored, or is it secretly subjective? Rewrite any that say "the output is good." Good llm eval set design has no judgment calls hiding in the rubric.
5. Feed the set to a runner
Now pick promptfoo, a notebook, or a script. The set you designed slots in. You skipped the trap of building infrastructure around criteria you hadn't thought through.
Prompt-craft patterns for eval design
Anchor every criterion. A pass criterion that says "the summary is accurate" is unscorable. One that says "names all three entities from the input and adds no entity not present" can be checked by a human or an LLM judge in seconds. Anchored eval scoring criteria are the difference between a real harness and theater.
For each pass criterion, write it so two reviewers would
agree on PASS/FAIL without discussion. If they might
disagree, the criterion is too vague. Rewrite it.
Sample the distribution plus the edges. Models fail at the boundaries: empty input, the longest realistic input, the malformed case. Tell the prompt to allocate cases to both the common path and the edges, so the regression cases from failures land where the agent actually breaks.
Restate the output contract last. On long inputs, GPT-4o drifts toward prose unless the case schema is repeated near the end of the prompt. Claude holds the schema better but still benefits from a ## Output format heading. Put the contract at the close, not the open.
Teams that swap runners every year keep their eval sets the whole time. The set encodes what you've learned about how the agent fails. The runner is plumbing. So invest the careful thinking in the cases and criteria, and treat the harness code as replaceable. A borrowed benchmark inverts this and you end up maintaining infrastructure for tests that never mapped to your product.
Variables you'll set
| Variable | Required | What it is |
|---|---|---|
{{agent_job}} | Yes | What the agent is supposed to accomplish |
{{example_inputs}} | Yes | 3–5 real inputs from actual usage |
{{failure_examples}} | No | Known bad outputs to convert to regression cases |
{{scoring_axes}} | Yes | The dimensions you score on |
{{eval_count}} | No | How many cases to generate |
Getting started
- Write the
{{agent_job}}in one concrete sentence. - Pull three to five real inputs into
{{example_inputs}}. - Add any known failures you've seen.
- Run the prompt and read every pass criterion for hidden subjectivity.
- Rewrite the vague ones until two reviewers would agree.
- Add the regression-promotion rule to your team's process.
- Feed the set to a runner. The Agent Eval Harness Builder playbook walks the full design pass with the case schema and the regression-promotion rule already structured.
Eval design rarely stops at the set. Once you have cases, you need a way to score open-ended outputs, which is where an LLM Eval System Design playbook picks up, and a way to score a coding agent's PRs pass/fail, which the Agent Output Verification Rubric handles.
The Agent Eval Harness Builder does this end-to-end: a {{failure_examples}} variable turns real breaks into regression cases, and the output contract locks each eval case to an anchored, binary pass criterion so your harness measures something defensible. It's part of The Complete AI Prompts Bundle, a one-time lifetime license to the whole catalog plus every pack added later, worth it if you run more than one eval-design or agent-ops job.
The judging side of evals deserves its own careful prompt, covered in the LLM-as-a-judge grader template. And once your set exists, you'll want to stop it from silently drifting as prompts change, which is exactly what prompt regression testing is for.
Browse the agent-ops prompt packs →Common questions
How do I build an LLM eval harness from scratch?
Do I need promptfoo or a framework to evaluate prompts?
What goes into a good LLM eval set?
Get the prompt packs this guide is built on
Ready-to-paste prompts with documented variables and worked examples for ChatGPT, Claude, and Gemini. One-time payment, own it forever.
More prompt guides

Resolve Merge Conflicts With an AI Prompt That Reads Both Branch Intents
A three-way merge fails, the markers land in the file, and the tempting move is to click "accept incoming" and run the tests. That works until the test suite is thin. Then a silent semantic break ship…

Detect Breaking API Changes Prompt: Diff a Surface, Get the Semver Bump
You changed a function signature, added two methods, and renamed a constant. Is that a minor release or a major one? Get it wrong and either you've shipped a breaking change as a minor, or you've scar…

Dependency Upgrade Prompt for Safe Package Bumps With a Plan
Dependabot opens a PR that bumps from 4.17 to 5.0 and leaves you to discover, via a red CI run, that the package dropped the function you call in twelve places. A dependency upgrade prompt does the pa…