Reduce AI Agent Token Cost With a Prompt That Profiles the Transcript
Reduce ai agent token cost with a prompt that profiles a transcript per step and returns the exact context-trimming and model-routing edits to make. Copy it.
The bill arrives and the agent worked fine, so nobody looks closely. Then it doubles. The instinct is to switch to a cheaper model everywhere, which tanks quality on the steps that needed the smart one. To reduce ai agent token cost without breaking the agent, you profile the transcript first and cut where the spend actually is, not where you guess it is.
Every explainer tells you why agents cost money. Almost none gives you a prompt that reads your transcript and returns the specific edits to make. That's the gap this post fills: a profiling prompt that breaks a run down step by step, attributes tokens and cost to each, and names the context-trimming and model-routing changes worth making.
Profiling beats guessing. Always. The expensive step is rarely the one you'd bet on.
Why efforts to reduce AI agent token cost miss the real spend
The cost isn't linear. Most agents re-send the full conversation plus tool output on every step, so by turn twenty the model is re-reading turns one through nineteen for the twentieth time. Augment's explainer on the agent loop calls out this roughly quadratic growth, fast.io's optimization tactics lists levers, and Morph's five-lever cost piece walks the categories. All correct on the theory. None hands you a prompt that profiles your transcript and tells you which step to fix first.
That's the difference between reading about cost and reducing it. A tactics list doesn't know that 60% of your spend is one tool that dumps a 40KB JSON blob into context every call.
Here's the take: re-sending full context every step is the default and it's almost always wrong past a few turns. The biggest single win in most agents isn't a cheaper model. It's carrying less forward. Summarize the old turns, drop the stale tool output, and the quadratic curve flattens.
What you can do with this prompt
- Break an agent run into steps with token and cost attributed to each.
- Spot the one or two steps eating most of the budget.
- Get specific context-trimming edits, not generic advice to "use less context."
- Identify steps safe to route to a cheaper model without quality loss.
- Catch redundant tool output that's re-sent every turn.
- Estimate the savings before you change a line of the agent.
Anatomy of the prompt
The prompt takes a transcript with token counts, profiles each step, then returns ranked edits.
Variables:
{{transcript}} – the agent run, step by step
{{token_counts}} – tokens per step if available
{{model_pricing}} – per-model input/output rates
{{quality_critical_steps}} – steps that must stay sharp
Prompt:
Role: cost engineer profiling an agent transcript.
Task: attribute tokens/cost per step, then rank edits
by savings. Never recommend cutting quality-critical steps.
Output contract (restate on the final line):
Per step: tokens_in, tokens_out, est_cost, % of total
Then: ranked edits:
- edit (trim context / route model / drop tool output)
- est_savings
- quality_risk: LOW | MEDIUM | HIGH
The {{quality_critical_steps}} variable is the guardrail. It stops the profiler from "saving" money by routing the one reasoning step that actually needs the frontier model.
Step-by-step usage
1. Capture a real transcript
Fill {{transcript}} with an actual run, every step including tool calls and their output. A synthetic transcript profiles a fiction. The whole value is in seeing where real runs bloat.
2. Add token counts if you have them
{{token_counts}} makes the profile exact. If your platform logs tokens per call, paste them. If not, the prompt estimates, which is rough but still ranks the steps correctly most of the time.
3. Mark the steps that can't be cheapened
{{quality_critical_steps}} protects the hard reasoning. List the steps where a smaller model would visibly hurt output. Everything else is fair game for routing.
4. Read the per-step breakdown first
Before the edits, read the attribution. The "% of total" column usually surprises people. One tool's output, or one over-stuffed system prompt re-sent every turn, tends to dominate. That's your target.
5. Apply edits in ranked order
The ranked list puts the biggest savings with the lowest quality risk first. Apply the LOW-risk context trims before touching model routing, then re-profile to confirm the curve actually flattened. The fastest way to reduce AI agent token cost is to act on the top one or two rows, not to spread thin edits across every step.
Cost-craft patterns
Trim what you carry, not what you generate. The model routing prompt advice everyone gives is to switch models. The bigger lever is usually context. Summarize old turns into a few lines and drop verbose tool output once it's been used, so each step carries less forward.
For each step, check what context is re-sent from prior
steps. Flag any tool output > 1KB re-sent unchanged across
3+ steps as a context-trimming candidate.
Route by step difficulty, not by default. Not every step needs the frontier model. Classification and routing steps run fine on a smaller one. Match the model to the step's difficulty rather than paying top rate for the whole loop.
For each step, judge whether a smaller model would produce
an equivalent result. Mark routing candidates LOW risk only
if the step is classification, extraction, or formatting.
Restate the contract last. On a long transcript, GPT-4o loses the per-step table structure unless the schema is repeated near the end; Claude holds it longer but still benefits from a ## Output format heading. Put the breakdown schema at the close so the profile stays structured across a big paste.
Switching models is the visible lever, so it gets the attention. The quiet one is context that grows every turn. An agent that carries its full history forward pays for the first message hundreds of times in a long run. Summarizing old turns and dropping used tool output usually beats a model downgrade, and it doesn't touch quality on the steps that matter. Profile before you assume the model is the problem.
Variables you'll set
| Variable | Required | What it is |
|---|---|---|
{{transcript}} | Yes | A real agent run, step by step |
{{token_counts}} | No | Tokens per step if your platform logs them |
{{model_pricing}} | No | Per-model input/output rates for cost math |
{{quality_critical_steps}} | Yes | Steps that must stay on the strong model |
Getting started
- Capture a real transcript, tool calls and all.
- Paste token counts if you have them.
- Mark the quality-critical steps so they're protected.
- Read the per-step attribution before the edits.
- Apply LOW-risk context trims first.
- Re-profile to confirm the spend actually dropped.
- Then consider model routing on the safe steps. The Token Cost Estimator ships this profiler with the per-step attribution and the ranked, risk-tagged edit list already built.
Cost work pairs with the rest of running agents in production. Before you trim, you want to know the agent still behaves, which the Agent Output Verification Rubric checks, and the Agent Eval Harness Builder gives you the eval set to confirm a cheaper model didn't quietly degrade quality.
The Token Cost Estimator does this end-to-end: a {{quality_critical_steps}} variable guards the steps you can't cheapen, and the output contract returns a per-step cost breakdown plus a ranked, risk-tagged edit list so you cut spend where it's safe. It's part of The Complete AI Prompts Bundle, a one-time lifetime license to the whole catalog plus future packs, which adds up if you run more than one agent in production.
If you're deciding whether a reusable profiling prompt beats writing your own each time, how to choose a reusable AI prompt pack lays out the call. And since a cheaper model can quietly change behavior, lock that down with prompt regression testing before you ship the cost cut.
Browse the agent-ops prompt packs →Common questions
How do I reduce AI agent token cost?
Why do AI agents cost so much per run?
Should I route some agent steps to a cheaper model?
Get the prompt packs this guide is built on
Ready-to-paste prompts with documented variables and worked examples for ChatGPT, Claude, and Gemini. One-time payment, own it forever.
More prompt guides

Prompt Regression Testing: A Template That Catches Drift Before CI Does
A prompt that worked perfectly on Tuesday returns malformed JSON on Friday. Nobody touched it. The provider rolled a new model version, and the contract that held on the old one slipped on the new one…

An Agent Output Verification Rubric You Can Paste on Every PR
A coding agent opens a pull request. It fixed the bug. It also renamed a variable in an unrelated file, bumped a dependency, and deleted a test that was "redundant." The diff is green. The review take…

An LLM-as-a-Judge Prompt: The Rubric Grader Template Tools Bury
Every eval platform documents llm-as-a-judge. Few hand you the prompt. The Langfuse docs explain the concept and the promptfoo guide wires it into config, while Towards Data Science's practical guide…