Skip to content
Book demo
Back to Guides

Flaky Test Detection and Remediation

Jun 28, 2026
Paula Hingel
Paula Hingel
Flaky Test Detection and Remediation

Category-aware pre-merge analysis detects flaky tests by combining repeated execution with historical failure-rate tracking. Differential coverage, order-dependency checks, and pass/fail history expose non-determinism while the failure still carries diagnostic value.

TL;DR

Flaky tests break CI confidence not when they fail, but when they pass on rerun. That same-commit flip normalizes non-determinism rather than resolving it, starting the retry-quarantine-trust-erosion cycle that eventually turns the whole suite into noise. Root-cause-mapped detection (pairing differential coverage, order-dependency checks, and flip-rate history) prevents this from becoming a permanent quarantine.

Flaky test detection protects pre-merge CI confidence by comparing repeated outcomes, execution order, and failure history before non-deterministic failures reach the main branch. The failure mode is familiar: a developer pushes an unchanged fix, watches CI fail, clicks rerun, and gets green on the same commit. At that point, the pipeline has stopped returning a clear answer on whether the code is broken.

Manual reruns hit a wall quickly. A single pipeline investigation averages $5.67 in developer time, and rare flakes can require up to 10,000 reruns to confirm. For pre-merge investigations that span multiple files, Augment Cosmos's Context Engine connects failing tests to codebase-wide dependency context, surfacing the shared-state and async-wait paths that isolated test logs don't expose.

[ Coming up next ]

The New Code Review Workflow for AI-Native Engineering Teams

See how leading teams keep code review fast and rigorous as AI writes more of the code.

Save your seat
Thu, Jul 9 // 9:45 AM PDT

Why Flaky Tests Break CI Confidence Faster Than Real Bugs

A flaky test fails, then passes on the same code revision, so the test result no longer maps cleanly to a defect. That ambiguity is the real damage. Once developers learn they can rerun their way to green, they stop treating red builds as evidence, and the suite loses its ability to catch real regressions.

Four Root Cause Categories Behind Flaky Tests

Tests become flaky when outcomes depend on factors outside the test's control: timing, shared state, thread ordering, or infrastructure differences. The important thing to understand is that these categories behave differently, so a single detection approach won't catch them all. Research across Java and Python suites consistently shows that async wait and concurrency dominate in Java, while order dependency is the leading cause in Python. A detector built for one language profile misses significant categories in another.

CategoryMechanismPrevalence SignalDetection NeedTypical Fix
Asynchronous WaitTest asserts before async work completes45% of studied flaky tests (Luo et al.)Timing and ordering checksAdd or adjust waitFor calls
Concurrency / RaceIncorrect assumptions about thread ordering20% of studied flaky tests (Luo et al.)Timing and ordering checksAdd locks; make execution deterministic
Test Order DependencyShared mutable state pollutes later runsThe majority of causes in Python suitesOrder permutations and shared-state analysisFix setup/teardown; isolate state
Environment / PlatformBehavior differs across CI infrastructure~28% of Python flaky testsCI infrastructure comparisonInfrastructure-specific

Asynchronous Wait

Async wait flakiness happens when a test asserts before the operation it depends on has finished. The test isn't wrong in theory; the ordering it expects is correct. The problem is that the code doesn't enforce that ordering, so the outcome depends on timing. In practice, the fix is usually straightforward: replace a fixed sleep with a waitFor call that blocks until the condition is actually true. The underlying issue is internal to the test logic in the vast majority of cases, not a dependency on external resources.

Concurrency and Race Conditions

Concurrency flakiness occurs when the test assumes a thread ordering that the system doesn't guarantee. The test passes when threads interleave correctly and fails when they don't. One important diagnostic distinction: in roughly a third of concurrency-flake cases, the non-determinism lies in the production code under test, not in the test itself. That changes the fix entirely: patching the test would mask a real defect.

Test Order Dependency and Shared Mutable State

Order dependency occurs when one test leaves the shared state dirty, and a later test fails as a result. The failing test looks broken in isolation, but if you run it first, it passes. This category is particularly prevalent in Python suites, where research across more than 1,000 projects found that it accounts for the majority of flaky tests. The fix almost always lives in setup and teardown: either the polluting test isn't cleaning up after itself, or the victim isn't initializing its own state cleanly enough to be order-independent.

Environment and Platform Dependencies

Environmental flakiness occurs when a test passes locally but consistently fails on CI. The cause is usually a genuine difference in infrastructure: Linux kernel version, parallelism settings, ephemeral runners that don't persist state between jobs. Unlike the other categories, this one often requires infrastructure changes rather than test code fixes.

How to Detect Flaky Tests Systematically Before They Erode CI Confidence

No single detection technique covers every category. DeFlaker catches tests that failed without touching changed code (a strong signal of flakiness) but misses order-dependent flakes entirely. iDFlakies finds order dependency by running the suite in different orders, but the cost scales with each permutation. ML-based predictors like FlakeFlagger can flag risky tests before they fail, though they do not confirm anything until a failure is actually observed. Understanding which approach fits which situation is where most teams go wrong.

Tool / TechniqueRequires RerunsDetects Order-DependentCI OverheadKey Limitation
DeFlakerNoNo4.6% per runMisses order-dependent flakes
iDFlakiesYesYesRepeated original-order and modified-order suite executionsCost grows with each order permutation
FlakeFlaggerNoNoFeature collection per run; no extra rerunsPredictive, not failure-confirming
FLASHYesNoConvergence-determinedSpecific to probabilistic/ML code
Retry/flip detectionYesPartialPer-build rerunsDetects only observed fail/pass flips

Differential Coverage Detection With DeFlaker

DeFlaker works by comparing which code a failing test actually executed against what changed in the current commit. If the test failed but never touched any changed code, the failure can't be caused by the change, so it's flagged as flaky. This avoids reruns entirely, which makes it fast. In practice, it found 87 previously unknown flaky tests across 10 of the 96 TravisCI projects studied, with a 1.5% false-alarm rate and 4.6% runtime overhead. The hard limitation is structural: because it relies on coverage of changed code, it can't detect tests that are flaky due to order dependency rather than code changes. Pair it with RootFinder to compare logs from passing and failing runs when you need to go deeper.

Order-Dependency Detection With iDFlakies

iDFlakies finds order-dependent tests by running the suite in its original order, then in shuffled orders, and flagging any test that passes in one arrangement but fails in another. It's the right tool when you suspect shared state is causing intermittent failures, particularly in Python-heavy suites or in any codebase with many global fixtures. The cost is real, though: each additional order permutation adds a full suite execution, so it's most practical for suites that already run multiple orderings as part of their normal CI configuration.

Order-dependency diagnosis often requires tracing paths for setup, teardown, fixtures, and shared state across the repository. With Augment Cosmos's Context Engine, teams can follow those paths beyond the isolated failing test log. Teams working on context-driven quality assurance will find the same cross-repo tracing applies directly to this root cause work.

ML Prediction With FlakeFlagger and Flakify

FlakeFlagger takes a different approach: rather than waiting for a test to fail and then diagnosing it, it uses a trained model to predict which tests are likely to be flaky before any failure occurs. The advantage is that it adds no rerun overhead. The limitation is that it can't confirm flakiness: it identifies risk, not evidence. For teams where even a small rerun budget is expensive, prediction gives you a prioritized list of where to invest investigation effort.

The rerun ceiling is worth understanding. Developers have rerun suspected flakes up to 1,000 times trying to reproduce intermittent failures, and some rare flakes only surface after 10,000 runs. Prediction methods are most valuable precisely for these hard-to-confirm cases.

Statistical Pass/Fail History Analysis

Statistical detection is the simplest approach: track whether a test passes after failing on the same commit. GitLab defines a test as flaky when it fails and then passes on the same SHA; TeamCity watches for status flips on the same code revision. It requires no instrumentation beyond your existing CI history, though it only catches flakes that have already been observed flipping; it won't surface tests that are flaky but haven't yet shown a visible flip in your pipeline.

Flip-rate triage works best when you can cross-reference it with issue tracking and code history. Augment Cosmos integrates with GitHub, GitLab, Jira, Linear, Slack, and other systems via MCP-backed integrations spanning 100+ third-party services. Teams comparing continuous integration tools should look specifically at how each one surfaces and retains flip-rate history across builds.

The Retry, Quarantine, Trust-Erosion Cycle That Teams Keep Rediscovering

Most teams don't decide to let their CI suite degrade. It happens gradually. A test starts failing intermittently, someone clicks rerun, it goes green, and the investigation stops there. That pattern repeats until reruns become the default response to any failure, at which point the suite has stopped functioning as a decision system.

StageLocal BenefitHidden CostSignal LostExit Criterion
RetriesFewer visible red buildsOriginal non-determinism remains unresolvedSame-commit fail/pass evidence becomes normalizedDiagnose repeated execution without code changes
QuarantineThe main pipeline keeps movingMuted or skipped tests stop gating mergesCoverage signal disappears during quarantineAssign owner, ticket, and time-bounded repair
Trust collapseDevelopers avoid blocking on noisy failuresRed builds are treated as noiseCI can no longer distinguish flakes from regressionsRestore reliable gating through repair and bounded quarantine

Stage One: Retries Become the Default Response

Retrying a failing test feels like the pragmatic call in the moment. It unblocks the build and moves the team forward. But it normalizes the underlying nondeterminism without addressing it. Every test that passes only on reruns is silently broken.

Stage Two: Quarantine Escalates the Coverage Loss

When retries aren't enough, teams start muting or skipping the worst offenders. The pipeline keeps moving, but those tests have stopped gating merges. Whatever defects they were designed to catch can now ship undetected until the test is eventually fixed, or forgotten.

Stage Three: Trust in the Whole Suite Collapses

The end state is a pipeline that developers have learned to ignore. When red builds are noise, engineers often stop treating them as evidence. The suite still runs, it just no longer changes anyone's behavior. That is the worst possible outcome for a test suite.

Flaky Test Remediation: Fixing the Test vs. Fixing the Non-Determinism

Locating the non-determinism is the work. Most flaky-test fixes touch only the test code, but a meaningful share involves production code. Misclassifying which one you're dealing with is how you end up masking a real defect with a test patch.

The Core Distinction Between Test Bug and Production Bug

The question to ask before touching any code is: where does the non-determinism actually live? If a concurrency test fails intermittently, it might be that the test is asserting on a race it shouldn't care about. Or it might be that the production code has a genuine race condition, and the test is correctly exposing. These call for completely different fixes.

CategoryMeaningDetection SignalFix TargetRisk If Misclassified
Application BugProduction code misbehaved; test functioned correctlyFailure exposes production non-determinismProduction codeA test-only fix masks a production defect
Test BugIncorrect selector, wrong expected value, missing waitFailure comes from the test codeTest codeNon-determinism remains unresolved
Flaky FailureIntermittent timing, environment, or non-determinismTest sometimes passes without code changesDepends on root causeRetries hide the defect signal
Environment IssueInfrastructure failure: network timeout, missing credentialsLocal-pass and CI-fail behaviorInfrastructureCI signal remains noisy

When a flaky-test fix changes only test code, automated PR review can provide a useful second check. Augment Code Review achieved 59% F-score in Augment's AI code review benchmark across seven platforms, with the highest precision and recall among tools evaluated. It reviews changes against codebase context, architectural patterns, and team standards.

Fix Strategies Mapped to Root Cause

The fix category follows from the root cause category. Async wait issues are resolved by adjusting waitFor. Concurrency issues are split between adding locks and making the execution path deterministic. Order-dependency fixes almost always live in setup and teardown: either the polluting test isn't cleaning up, or the victim isn't establishing its own clean state.

Open source
augmentcode/auggie245
Star on GitHub
  • Reproduce the inconsistent behavior by running the test repeatedly without changing code.
  • Classify the failure: async wait, concurrency, order dependency, or environment drift.
  • Determine whether the non-determinism lies in the test code, the production code, or the infrastructure.
  • Apply the fix before retries or quarantine, and remove the original failure context.
  • Keep the test visible until the pass/fail history confirms the repair held.

Some flaky tests resist diagnosis. Lam et al. documented cases where neither researchers nor developers could identify the root cause, and the applied fixes were likely educated guesses. The same root-cause discipline applies when working through regression testing practices more broadly; the diagnostic sequence is the same, whether the failure is intermittent or consistent.

Quarantine Flaky Tests as a Waypoint, Not a Destination

Quarantine is a tool for managing the pipeline while a fix is in progress, not a permanent state. There's an important operational distinction between muting and skipping: a muted test still runs and still reports its result, so you can detect when it stabilizes. A skipped test disappears entirely. Buildkite recommends muting for exactly this reason: you preserve the signal even while removing it from the gate.

Operational guardrails keep quarantine bounded:

  • Quarantine automatically when a test exceeds a defined failure rate threshold and create a tracking ticket with an owner.
  • Run quarantined tests in a separate CI job so the main pipeline stays reliable while results are still collected.
  • Set a flakiness budget (fewer than 2% of tests flaky at any time) and treat overruns as a priority.

The two cautionary examples from industry are worth noting. Microsoft's quarantine system runs three phases: inference (identifying flaky tests from test telemetry), bug filing (assigning ownership and notifying developers), and mitigation (quarantining known flaky tests so they don't block pipelines). Critically, tests are removed from quarantine once a developer closes the associated bug, ensuring permanent coverage loss is avoided. Meta built a Probabilistic Flakiness Score after recognizing that all real-world tests are flaky to some degree, and that the right question is not whether a test is flaky but how flaky it is. The score provides teams with a way to monitor test reliability over time and react quickly to regressions within the suite.

Where Manual Detection Hits Its Ceiling at Enterprise Scale

At a certain suite size, manual investigation of flaky tests becomes infeasible. A five-year industrial case study of roughly 30 developers found that flaky tests consumed at least 2.5% of total productive developer time, and that's in a team where people were actively trying to manage them. The investigation cost per failure averaged $5.67, while an automated rerun costs approximately 0.02 cents in computing. The cost gap is extreme, but the bigger problem is that reruns don't produce a diagnosis. They produce a pass/fail observation, which tells you nothing about what caused the failure or whether it will recur.

Scale pressureEvidence signalWhy manual triage failsAutomation needRisk if ignored
Developer timeAt least 2.5% of productive timeHuman investigation becomes the bottleneckTrack failure history automaticallyCI confidence erodes
Investigation cost$5.67 per failureOne-off analysis costs more than rerunsPrioritize expensive failuresRed builds become noise
Compute cost~0.02 cents per flaky-test rerunCheap reruns normalize ignored failuresCompare repeated outcomesSame-commit flips are hidden
Rerun ceilingUp to 10,000 rerunsRare flakes still escape confirmationAdd prediction and history signalsConfirmed flakes stay undetected
Diagnostic scopeRuns, runners, orders, and environmentsLogs show only a single runCompare behavior across contextsRoot causes remain unresolved

Logs record a single run. Diagnosing flakiness requires comparing behavior across runs, runners, orders, and environments, which is exactly what logs don't show. Flaky test rate is one of the more telling code quality metrics to track precisely because it degrades CI reliability in ways that other signals don't capture.

Augment Cosmos's Parallel Tool Calls lets agents run coverage checks, log comparisons, and order-dependency scans simultaneously rather than sequentially, returning results faster while keeping investigation steps explicit. Augment Cosmos can also compare repeated CI outcomes before quarantine decisions are made, turning one-off investigations into auditable workflows with shared context across the team's codebase history.

Category-Aware Detection: Where to Start This Sprint

Start with differential coverage for non-order-dependent flakes: it adds minimal overhead and requires no reruns. Layer in order-dependency checks if your suite is Python-heavy or relies on shared fixtures. Then set up flip-rate tracking across commits so quarantine decisions are driven by data rather than developer memory. That sequencing matters because reruns cost fractions of a cent per run while developer investigation averages $5.67 per failure.

Augment Cosmos with Context Engine connects repeated CI outcomes to code, test, and infrastructure context during verifier-stage investigation, keeping root-cause checks close to the pre-deploy decision point.

Frequently Asked Questions About Flaky Test Detection

Written by

Paula Hingel

Paula Hingel

Paula writes about the patterns that make AI coding agents actually work — spec-driven development, multi-agent orchestration, and the context engineering layer most teams skip. Her guides draw on real build examples and focus on what changes when you move from a single AI assistant to a full agentic codebase.

Get Started

Give your codebase the agents it deserves

Install Augment to get started. Works with codebases of any size, from side projects to enterprise monorepos.