Skip to main content
Agent promptsAi promptsLlm costClaude prompts

Reduce AI Agent Token Cost With a Prompt That Profiles the Transcript

Reduce ai agent token cost with a prompt that profiles a transcript per step and returns the exact context-trimming and model-routing edits to make. Copy it.

PPromptsCart Team·July 1, 2026·Updated July 1, 2026·7 min read

The bill arrives and the agent worked fine, so nobody looks closely. Then it doubles. The instinct is to switch to a cheaper model everywhere, which tanks quality on the steps that needed the smart one. To reduce ai agent token cost without breaking the agent, you profile the transcript first and cut where the spend actually is, not where you guess it is.

Every explainer tells you why agents cost money. Almost none gives you a prompt that reads your transcript and returns the specific edits to make. That's the gap this post fills: a profiling prompt that breaks a run down step by step, attributes tokens and cost to each, and names the context-trimming and model-routing changes worth making.

Profiling beats guessing. Always. The expensive step is rarely the one you'd bet on.

Why efforts to reduce AI agent token cost miss the real spend

The cost isn't linear. Most agents re-send the full conversation plus tool output on every step, so by turn twenty the model is re-reading turns one through nineteen for the twentieth time. Augment's explainer on the agent loop calls out this roughly quadratic growth, fast.io's optimization tactics lists levers, and Morph's five-lever cost piece walks the categories. All correct on the theory. None hands you a prompt that profiles your transcript and tells you which step to fix first.

That's the difference between reading about cost and reducing it. A tactics list doesn't know that 60% of your spend is one tool that dumps a 40KB JSON blob into context every call.

Here's the take: re-sending full context every step is the default and it's almost always wrong past a few turns. The biggest single win in most agents isn't a cheaper model. It's carrying less forward. Summarize the old turns, drop the stale tool output, and the quadratic curve flattens.

What you can do with this prompt

  • Break an agent run into steps with token and cost attributed to each.
  • Spot the one or two steps eating most of the budget.
  • Get specific context-trimming edits, not generic advice to "use less context."
  • Identify steps safe to route to a cheaper model without quality loss.
  • Catch redundant tool output that's re-sent every turn.
  • Estimate the savings before you change a line of the agent.

Anatomy of the prompt

The prompt takes a transcript with token counts, profiles each step, then returns ranked edits.

Variables:
  {{transcript}}      – the agent run, step by step
  {{token_counts}}    – tokens per step if available
  {{model_pricing}}   – per-model input/output rates
  {{quality_critical_steps}} – steps that must stay sharp

Prompt:
  Role: cost engineer profiling an agent transcript.
  Task: attribute tokens/cost per step, then rank edits
  by savings. Never recommend cutting quality-critical steps.

Output contract (restate on the final line):
  Per step: tokens_in, tokens_out, est_cost, % of total
  Then: ranked edits:
    - edit (trim context / route model / drop tool output)
    - est_savings
    - quality_risk: LOW | MEDIUM | HIGH

The {{quality_critical_steps}} variable is the guardrail. It stops the profiler from "saving" money by routing the one reasoning step that actually needs the frontier model.

Step-by-step usage

1. Capture a real transcript

Fill {{transcript}} with an actual run, every step including tool calls and their output. A synthetic transcript profiles a fiction. The whole value is in seeing where real runs bloat.

2. Add token counts if you have them

{{token_counts}} makes the profile exact. If your platform logs tokens per call, paste them. If not, the prompt estimates, which is rough but still ranks the steps correctly most of the time.

3. Mark the steps that can't be cheapened

{{quality_critical_steps}} protects the hard reasoning. List the steps where a smaller model would visibly hurt output. Everything else is fair game for routing.

4. Read the per-step breakdown first

Before the edits, read the attribution. The "% of total" column usually surprises people. One tool's output, or one over-stuffed system prompt re-sent every turn, tends to dominate. That's your target.

5. Apply edits in ranked order

The ranked list puts the biggest savings with the lowest quality risk first. Apply the LOW-risk context trims before touching model routing, then re-profile to confirm the curve actually flattened. The fastest way to reduce AI agent token cost is to act on the top one or two rows, not to spread thin edits across every step.

Cost-craft patterns

Trim what you carry, not what you generate. The model routing prompt advice everyone gives is to switch models. The bigger lever is usually context. Summarize old turns into a few lines and drop verbose tool output once it's been used, so each step carries less forward.

For each step, check what context is re-sent from prior
steps. Flag any tool output > 1KB re-sent unchanged across
3+ steps as a context-trimming candidate.

Route by step difficulty, not by default. Not every step needs the frontier model. Classification and routing steps run fine on a smaller one. Match the model to the step's difficulty rather than paying top rate for the whole loop.

For each step, judge whether a smaller model would produce
an equivalent result. Mark routing candidates LOW risk only
if the step is classification, extraction, or formatting.

Restate the contract last. On a long transcript, GPT-4o loses the per-step table structure unless the schema is repeated near the end; Claude holds it longer but still benefits from a ## Output format heading. Put the breakdown schema at the close so the profile stays structured across a big paste.

The cheapest token is the one you don't re-send

Switching models is the visible lever, so it gets the attention. The quiet one is context that grows every turn. An agent that carries its full history forward pays for the first message hundreds of times in a long run. Summarizing old turns and dropping used tool output usually beats a model downgrade, and it doesn't touch quality on the steps that matter. Profile before you assume the model is the problem.

Variables you'll set

VariableRequiredWhat it is
{{transcript}}YesA real agent run, step by step
{{token_counts}}NoTokens per step if your platform logs them
{{model_pricing}}NoPer-model input/output rates for cost math
{{quality_critical_steps}}YesSteps that must stay on the strong model

Getting started

  1. Capture a real transcript, tool calls and all.
  2. Paste token counts if you have them.
  3. Mark the quality-critical steps so they're protected.
  4. Read the per-step attribution before the edits.
  5. Apply LOW-risk context trims first.
  6. Re-profile to confirm the spend actually dropped.
  7. Then consider model routing on the safe steps. The Token Cost Estimator ships this profiler with the per-step attribution and the ranked, risk-tagged edit list already built.
Get the Token Cost Estimator

Cost work pairs with the rest of running agents in production. Before you trim, you want to know the agent still behaves, which the Agent Output Verification Rubric checks, and the Agent Eval Harness Builder gives you the eval set to confirm a cheaper model didn't quietly degrade quality.

Skip the setup

The Token Cost Estimator does this end-to-end: a {{quality_critical_steps}} variable guards the steps you can't cheapen, and the output contract returns a per-step cost breakdown plus a ranked, risk-tagged edit list so you cut spend where it's safe. It's part of The Complete AI Prompts Bundle, a one-time lifetime license to the whole catalog plus future packs, which adds up if you run more than one agent in production.

Get the Token Cost Estimator

If you're deciding whether a reusable profiling prompt beats writing your own each time, how to choose a reusable AI prompt pack lays out the call. And since a cheaper model can quietly change behavior, lock that down with prompt regression testing before you ship the cost cut.

Browse the agent-ops prompt packs
FAQ

Common questions

How do I reduce AI agent token cost?
Profile a real transcript first. Find which steps burn the most tokens, then apply targeted edits: trim the context you re-send each turn, route cheap steps to a smaller model, and cut redundant tool output. Profiling before cutting is what separates real savings from guesswork.
Why do AI agents cost so much per run?
Most agents re-send the full conversation and tool history on every step, so context grows roughly quadratically as the run gets longer. A long agent loop pays for the same early tokens dozens of times. Trimming what you carry forward is usually the biggest lever.
Should I route some agent steps to a cheaper model?
Often yes. Classification, routing, and short extraction steps rarely need the frontier model. Routing those to a smaller model while keeping the hard reasoning on the expensive one can cut cost a lot without hurting quality, but profile first to see where the spend actually is.
Stop reading. Start shipping.

Get the prompt packs this guide is built on

Ready-to-paste prompts with documented variables and worked examples for ChatGPT, Claude, and Gemini. One-time payment, own it forever.