Skip to main content
AI PROMPTSDEVOPS PROMPTSCI CDCLAUDE PROMPTS

Write a CI Failure Diagnosis Prompt That Reads Pipeline Logs

Turn a wall of red CI logs into a root-cause verdict. A reusable ci failure diagnosis prompt that separates real failures from flake. Copy the structure.

PPromptsCart Team·May 8, 2026·Updated June 14, 2026·7 min read

A failing pipeline drops a few hundred lines of red into your terminal, and somewhere in there is the one stack frame that matters. Finding it is the job. Most of the time you're scrolling past dependency-install noise, retry spam, and a timeout that wasn't the real cause. A good ci failure diagnosis prompt does that scroll for you and hands back the line that broke the build.

This isn't about a clever one-off question. It's about a reusable prompt with a locked output contract, so the diagnosis comes back in the same shape every run and you can route it straight to a ticket. Paste logs, get a verdict. Same fields, every time.

The pages that rank for this query mostly don't do that. They give you a list of prompts to try or a tutorial on wiring a custom tool. Useful once. Not reusable.

What a CI diagnosis prompt actually does

A CI failure diagnosis prompt is a reusable instruction that takes pipeline logs as input and returns a structured root cause: the real error, a failure-type label, and a concrete fix. The value isn't the model's cleverness. It's the contract that makes every run comparable.

Here's what you can hand it:

  • A single failing job's log from GitHub Actions, GitLab CI, or CircleCI
  • A multi-stage pipeline where stage three failed but stage two emitted warnings
  • A test suite log where you can't tell if the failure is real or a timing race
  • A docker build that died on a layer with a misleading exit code
  • A deploy step that failed on a transient registry timeout

The recurring job underneath all of these is the same: read noise, find signal, decide what to do. That's worth packaging once and reusing on every red build.

The anatomy: variables, prompt, output contract

A reusable diagnosis prompt has three parts. The pasted log goes in a variable. The instructions frame the model as a build engineer. The output contract locks the shape.

Variables
  {{pipeline_log}}   — the raw failing-job log
  {{stack}}          — e.g. "GitHub Actions, Node 20, Jest"

Prompt
  Role: You are a senior build engineer triaging a red CI run.
  Task: Read {{pipeline_log}}. Find the first failure that
        caused the run to fail. Ignore retries and downstream noise.

Output contract (return exactly these fields)
  failure_type:   real_failure | infra_flake | test_flake
  evidence_line:  the exact log line that proves the label
  root_cause:     one sentence
  fix:            file-specific, actionable
  confidence:     high | medium | low

Notice the evidence_line field. That's the part that stops the model from confidently inventing a cause. If it has to quote the line, it can't hand-wave.

Why the output contract goes last

Models weight the most recent tokens. When you put the contract first and a 400-line {{pipeline_log}} after it, the contract gets buried and the model drifts toward freeform prose. Put the pasted log in the middle and the contract on the final lines. Claude honors a trailing ## Output format heading reliably; GPT-4o needs the field list restated on the very last line or it starts narrating.

Trim the log before you paste

Don't paste the entire build. Paste the failing step plus roughly 20 lines of context above it. Long logs cost tokens and bury the signal, and models do worse at finding the real error when 90% of the input is npm install chatter. Less context, sharper diagnosis.

Step-by-step: from red build to routed ticket

1. Grab the failing job log

Copy the log from the failing step only. Most CI UIs let you expand and copy a single job.

2. Fill the variables

Drop the log into {{pipeline_log}} and name your {{stack}}. The stack hint matters: Jest flake reads differently from a flaky Cypress run, and the model uses that to label test_flake correctly.

3. Run and read the verdict

You get five fields back. Read failure_type first. If it says infra_flake with a registry-timeout evidence line, you re-run. If it says real_failure, you read the fix.

4. Route it

Because the output is structured, you can paste it into a GitHub issue or a Slack thread without reshaping it. That's the whole point of the contract.

5. Iterate on misses

If the model mislabels a real failure as flake, add one few-shot example of that exact pattern to the prompt. One good example beats a paragraph of new instructions.

Prompt-craft patterns that make the difference

Force a label before a fix. Make the model commit to failure_type before it writes the fix. A model that decides "this is flake" won't then write a code change. Ordering the fields this way prevents contradictory output.

Demand evidence, not assertion. The evidence_line field is a refusal boundary in disguise. If the model can't find a line to quote, that's a signal the log doesn't contain the real failure and you need to widen the context.

Separate diagnosis from action. A single prompt that diagnoses is reusable everywhere. A prompt that also opens issues and posts to Slack needs connected tools and belongs in a multi-prompt harness, not a copy-paste box. Keep the read-only diagnosis prompt portable; let the harness handle the side effects.

Variables you'll set

VariableRequiredWhat it is
{{pipeline_log}}YesThe raw log from the failing CI job
{{stack}}NoBuild stack hint, e.g. "GitLab CI, Python 3.12, pytest"
{{recent_changes}}NoThe PR diff or commit range, if you want the model to correlate

An opinion worth holding

Most "AI debugs your CI" content sells you a tool build. You don't need one to start. A locked output contract on a plain prompt gets you 80% of the value with zero infrastructure, and you can run it in whatever chat window you already have open. Wire up the automation later, once you've confirmed the prompt actually labels flake correctly on your stack. Tooling first is backwards. The prompt is the product; the harness is plumbing.

Getting started

  1. Copy the anatomy above into your chat model of choice.
  2. Paste a recent failing log into {{pipeline_log}}.
  3. Name your {{stack}} so the flake detection has context.
  4. Run it and check whether failure_type matches your read.
  5. Add one few-shot example for any pattern it gets wrong.
  6. Once it's reliable, graduate to a harness that posts the verdict automatically.

When you're ready to skip the manual paste, the CI Failure Diagnosis Harness Agent Pack runs this as four connected prompts that read the logs, separate real failures from flake, and post the diagnosis to GitHub and Slack on their own.

Browse the DevOps prompt packs
Skip the setup

The CI Failure Diagnosis Harness Agent Pack does this end-to-end: a four-prompt harness with the {{pipeline_log}} read and a structured failure-type-plus-fix output contract, so you get a routed diagnosis instead of a raw log dump. It's part of The Complete AI Prompts Bundle, a one-time lifetime license to the whole catalog plus every pack added later, worth it if you run more than one of these DevOps jobs.

Get the CI Failure Diagnosis Harness

Pair this with the Flaky Test Detection Harness Agent Pack when the diagnosis keeps coming back test_flake and you need to quarantine the offenders. For the broader picture on packaging reusable prompts, see how to choose a reusable AI prompt pack and the companion piece on the flaky test detection prompt for CI pipelines.

FAQ

Common questions

What is a CI failure diagnosis prompt?
It's a reusable prompt that takes raw CI/CD pipeline logs as input and returns a structured verdict: the real error, whether it's a code failure or infra flake, and a file-specific fix. The output contract keeps every diagnosis in the same shape so you can route it to a ticket or a Slack message.
Can ChatGPT or Claude diagnose a broken pipeline from logs?
Yes, if you give the model the failing job's log and a tight output contract. Claude handles long multi-stage logs better when you put the contract last; GPT-4o needs the schema restated on the final line. Paste only the failing step plus a few lines of context, not the whole build.
How do you stop the model from guessing on flaky tests?
Add an explicit classification field — real failure versus infra flake versus test flake — and force the model to cite the log line that justifies the label. A prompt that has to point at evidence guesses far less than one that just emits a fix.
Stop reading. Start shipping.

Get the prompt packs this guide is built on

Ready-to-paste prompts with documented variables and worked examples for ChatGPT, Claude, and Gemini. One-time payment, own it forever.