What is the difference between detecting and fixing a flaky test?

Detection identifies which tests produce inconsistent results, while fixing addresses the root cause. Tools like DeFlaker and iDFlakies flag flaky tests but do not repair them. Fixing requires diagnosing whether non-determinism lives in test code, production code, or infrastructure.

How many times should engineering teams rerun a test to confirm flakiness?

Run the failing test repeatedly without changing code. If the test sometimes passes, it is likely flaky; if it fails every time, it is more likely a bug. Rare flakes can require thousands of reruns, so prediction methods complement reruns for confirmation.

Does flaky test detection belong in pre-deploy CI or post-deploy monitoring?

Use flaky test detection in CI verification before non-deterministic failures reach the main branch. Post-deploy observability tracks production metrics, events, logs, and traces, which belong to a separate domain.

Why do tests pass locally but fail in CI?

Local-pass and CI-fail behavior signals environment or platform dependency. Runner differences expose non-determinism absent on developer laptops: parallelism, ephemeral runners, and shared infrastructure all introduce conditions that local environments don't replicate. CI can also surface real latent defects, such as test pollution or race conditions.

Should engineering teams quarantine or delete a flaky test?

Use quarantine as a temporary holding state, then fix the test. Deletion permanently results in coverage loss, while indefinite quarantine creates silent coverage gaps. Prefer muting over skipping so the test still reports results.

Flaky Test Detection and Remediation

Category-aware pre-merge analysis detects flaky tests by combining repeated execution with historical failure-rate tracking. Differential coverage, order-dependency checks, and pass/fail history expose non-determinism while the failure still carries diagnostic value.

TL;DR

Flaky tests break CI confidence not when they fail, but when they pass on rerun. That same-commit flip normalizes non-determinism rather than resolving it, starting the retry-quarantine-trust-erosion cycle that eventually turns the whole suite into noise. Root-cause-mapped detection (pairing differential coverage, order-dependency checks, and flip-rate history) prevents this from becoming a permanent quarantine.

Flaky test detection protects pre-merge CI confidence by comparing repeated outcomes, execution order, and failure history before non-deterministic failures reach the main branch. The failure mode is familiar: a developer pushes an unchanged fix, watches CI fail, clicks rerun, and gets green on the same commit. At that point, the pipeline has stopped returning a clear answer on whether the code is broken.

Manual reruns hit a wall quickly. A single pipeline investigation averages $5.67 in developer time, and rare flakes can require up to 10,000 reruns to confirm. For pre-merge investigations that span multiple files, Augment Cosmos's Context Engine connects failing tests to codebase-wide dependency context, surfacing the shared-state and async-wait paths that isolated test logs don't expose.

[ Coming up next ]

The New Code Review Workflow for AI-Native Engineering Teams

See how leading teams keep code review fast and rigorous as AI writes more of the code.

Save your seat

— Thu, Jul 9 // 9:45 AM PDT

Why Flaky Tests Break CI Confidence Faster Than Real Bugs

A flaky test fails, then passes on the same code revision, so the test result no longer maps cleanly to a defect. That ambiguity is the real damage. Once developers learn they can rerun their way to green, they stop treating red builds as evidence, and the suite loses its ability to catch real regressions.

Four Root Cause Categories Behind Flaky Tests

Tests become flaky when outcomes depend on factors outside the test's control: timing, shared state, thread ordering, or infrastructure differences. The important thing to understand is that these categories behave differently, so a single detection approach won't catch them all. Research across Java and Python suites consistently shows that async wait and concurrency dominate in Java, while order dependency is the leading cause in Python. A detector built for one language profile misses significant categories in another.

Category	Mechanism	Prevalence Signal	Detection Need	Typical Fix
Asynchronous Wait	Test asserts before async work completes	45% of studied flaky tests (Luo et al.)	Timing and ordering checks	Add or adjust waitFor calls
Concurrency / Race	Incorrect assumptions about thread ordering	20% of studied flaky tests (Luo et al.)	Timing and ordering checks	Add locks; make execution deterministic
Test Order Dependency	Shared mutable state pollutes later runs	The majority of causes in Python suites	Order permutations and shared-state analysis	Fix setup/teardown; isolate state
Environment / Platform	Behavior differs across CI infrastructure	~28% of Python flaky tests	CI infrastructure comparison	Infrastructure-specific

Asynchronous Wait

Async wait flakiness happens when a test asserts before the operation it depends on has finished. The test isn't wrong in theory; the ordering it expects is correct. The problem is that the code doesn't enforce that ordering, so the outcome depends on timing. In practice, the fix is usually straightforward: replace a fixed sleep with a waitFor call that blocks until the condition is actually true. The underlying issue is internal to the test logic in the vast majority of cases, not a dependency on external resources.

Concurrency and Race Conditions

Concurrency flakiness occurs when the test assumes a thread ordering that the system doesn't guarantee. The test passes when threads interleave correctly and fails when they don't. One important diagnostic distinction: in roughly a third of concurrency-flake cases, the non-determinism lies in the production code under test, not in the test itself. That changes the fix entirely: patching the test would mask a real defect.

Test Order Dependency and Shared Mutable State

Order dependency occurs when one test leaves the shared state dirty, and a later test fails as a result. The failing test looks broken in isolation, but if you run it first, it passes. This category is particularly prevalent in Python suites, where research across more than 1,000 projects found that it accounts for the majority of flaky tests. The fix almost always lives in setup and teardown: either the polluting test isn't cleaning up after itself, or the victim isn't initializing its own state cleanly enough to be order-independent.

Environment and Platform Dependencies

Environmental flakiness occurs when a test passes locally but consistently fails on CI. The cause is usually a genuine difference in infrastructure: Linux kernel version, parallelism settings, ephemeral runners that don't persist state between jobs. Unlike the other categories, this one often requires infrastructure changes rather than test code fixes.

How to Detect Flaky Tests Systematically Before They Erode CI Confidence

No single detection technique covers every category. DeFlaker catches tests that failed without touching changed code (a strong signal of flakiness) but misses order-dependent flakes entirely. iDFlakies finds order dependency by running the suite in different orders, but the cost scales with each permutation. ML-based predictors like FlakeFlagger can flag risky tests before they fail, though they do not confirm anything until a failure is actually observed. Understanding which approach fits which situation is where most teams go wrong.

Tool / Technique	Requires Reruns	Detects Order-Dependent	CI Overhead	Key Limitation
DeFlaker	No	No	4.6% per run	Misses order-dependent flakes
iDFlakies	Yes	Yes	Repeated original-order and modified-order suite executions	Cost grows with each order permutation
FlakeFlagger	No	No	Feature collection per run; no extra reruns	Predictive, not failure-confirming
FLASH	Yes	No	Convergence-determined	Specific to probabilistic/ML code
Retry/flip detection	Yes	Partial	Per-build reruns	Detects only observed fail/pass flips

Differential Coverage Detection With DeFlaker

DeFlaker works by comparing which code a failing test actually executed against what changed in the current commit. If the test failed but never touched any changed code, the failure can't be caused by the change, so it's flagged as flaky. This avoids reruns entirely, which makes it fast. In practice, it found 87 previously unknown flaky tests across 10 of the 96 TravisCI projects studied, with a 1.5% false-alarm rate and 4.6% runtime overhead. The hard limitation is structural: because it relies on coverage of changed code, it can't detect tests that are flaky due to order dependency rather than code changes. Pair it with RootFinder to compare logs from passing and failing runs when you need to go deeper.

Order-Dependency Detection With iDFlakies

iDFlakies finds order-dependent tests by running the suite in its original order, then in shuffled orders, and flagging any test that passes in one arrangement but fails in another. It's the right tool when you suspect shared state is causing intermittent failures, particularly in Python-heavy suites or in any codebase with many global fixtures. The cost is real, though: each additional order permutation adds a full suite execution, so it's most practical for suites that already run multiple orderings as part of their normal CI configuration.

Order-dependency diagnosis often requires tracing paths for setup, teardown, fixtures, and shared state across the repository. With Augment Cosmos's Context Engine, teams can follow those paths beyond the isolated failing test log. Teams working on context-driven quality assurance will find the same cross-repo tracing applies directly to this root cause work.

ML Prediction With FlakeFlagger and Flakify

FlakeFlagger takes a different approach: rather than waiting for a test to fail and then diagnosing it, it uses a trained model to predict which tests are likely to be flaky before any failure occurs. The advantage is that it adds no rerun overhead. The limitation is that it can't confirm flakiness: it identifies risk, not evidence. For teams where even a small rerun budget is expensive, prediction gives you a prioritized list of where to invest investigation effort.

The rerun ceiling is worth understanding. Developers have rerun suspected flakes up to 1,000 times trying to reproduce intermittent failures, and some rare flakes only surface after 10,000 runs. Prediction methods are most valuable precisely for these hard-to-confirm cases.

Statistical Pass/Fail History Analysis

Statistical detection is the simplest approach: track whether a test passes after failing on the same commit. GitLab defines a test as flaky when it fails and then passes on the same SHA; TeamCity watches for status flips on the same code revision. It requires no instrumentation beyond your existing CI history, though it only catches flakes that have already been observed flipping; it won't surface tests that are flaky but haven't yet shown a visible flip in your pipeline.

Flip-rate triage works best when you can cross-reference it with issue tracking and code history. Augment Cosmos integrates with GitHub, GitLab, Jira, Linear, Slack, and other systems via MCP-backed integrations spanning 100+ third-party services. Teams comparing continuous integration tools should look specifically at how each one surfaces and retains flip-rate history across builds.

The Retry, Quarantine, Trust-Erosion Cycle That Teams Keep Rediscovering

Most teams don't decide to let their CI suite degrade. It happens gradually. A test starts failing intermittently, someone clicks rerun, it goes green, and the investigation stops there. That pattern repeats until reruns become the default response to any failure, at which point the suite has stopped functioning as a decision system.

Stage	Local Benefit	Hidden Cost	Signal Lost	Exit Criterion
Retries	Fewer visible red builds	Original non-determinism remains unresolved	Same-commit fail/pass evidence becomes normalized	Diagnose repeated execution without code changes
Quarantine	The main pipeline keeps moving	Muted or skipped tests stop gating merges	Coverage signal disappears during quarantine	Assign owner, ticket, and time-bounded repair
Trust collapse	Developers avoid blocking on noisy failures	Red builds are treated as noise	CI can no longer distinguish flakes from regressions	Restore reliable gating through repair and bounded quarantine

Stage One: Retries Become the Default Response

Retrying a failing test feels like the pragmatic call in the moment. It unblocks the build and moves the team forward. But it normalizes the underlying nondeterminism without addressing it. Every test that passes only on reruns is silently broken.

Stage Two: Quarantine Escalates the Coverage Loss

When retries aren't enough, teams start muting or skipping the worst offenders. The pipeline keeps moving, but those tests have stopped gating merges. Whatever defects they were designed to catch can now ship undetected until the test is eventually fixed, or forgotten.

Stage Three: Trust in the Whole Suite Collapses

The end state is a pipeline that developers have learned to ignore. When red builds are noise, engineers often stop treating them as evidence. The suite still runs, it just no longer changes anyone's behavior. That is the worst possible outcome for a test suite.

Flaky Test Remediation: Fixing the Test vs. Fixing the Non-Determinism

Locating the non-determinism is the work. Most flaky-test fixes touch only the test code, but a meaningful share involves production code. Misclassifying which one you're dealing with is how you end up masking a real defect with a test patch.

The Core Distinction Between Test Bug and Production Bug

The question to ask before touching any code is: where does the non-determinism actually live? If a concurrency test fails intermittently, it might be that the test is asserting on a race it shouldn't care about. Or it might be that the production code has a genuine race condition, and the test is correctly exposing. These call for completely different fixes.

Category	Meaning	Detection Signal	Fix Target	Risk If Misclassified
Application Bug	Production code misbehaved; test functioned correctly	Failure exposes production non-determinism	Production code	A test-only fix masks a production defect
Test Bug	Incorrect selector, wrong expected value, missing wait	Failure comes from the test code	Test code	Non-determinism remains unresolved
Flaky Failure	Intermittent timing, environment, or non-determinism	Test sometimes passes without code changes	Depends on root cause	Retries hide the defect signal
Environment Issue	Infrastructure failure: network timeout, missing credentials	Local-pass and CI-fail behavior	Infrastructure	CI signal remains noisy

When a flaky-test fix changes only test code, automated PR review can provide a useful second check. Augment Code Review achieved 59% F-score in Augment's AI code review benchmark across seven platforms, with the highest precision and recall among tools evaluated. It reviews changes against codebase context, architectural patterns, and team standards.

Fix Strategies Mapped to Root Cause

The fix category follows from the root cause category. Async wait issues are resolved by adjusting waitFor. Concurrency issues are split between adding locks and making the execution path deterministic. Order-dependency fixes almost always live in setup and teardown: either the polluting test isn't cleaning up, or the victim isn't establishing its own clean state.

Open source

augmentcode/auggie★245

Star on GitHub

Reproduce the inconsistent behavior by running the test repeatedly without changing code.
Classify the failure: async wait, concurrency, order dependency, or environment drift.
Determine whether the non-determinism lies in the test code, the production code, or the infrastructure.
Apply the fix before retries or quarantine, and remove the original failure context.
Keep the test visible until the pass/fail history confirms the repair held.

Some flaky tests resist diagnosis. Lam et al. documented cases where neither researchers nor developers could identify the root cause, and the applied fixes were likely educated guesses. The same root-cause discipline applies when working through regression testing practices more broadly; the diagnostic sequence is the same, whether the failure is intermittent or consistent.

Quarantine Flaky Tests as a Waypoint, Not a Destination

Quarantine is a tool for managing the pipeline while a fix is in progress, not a permanent state. There's an important operational distinction between muting and skipping: a muted test still runs and still reports its result, so you can detect when it stabilizes. A skipped test disappears entirely. Buildkite recommends muting for exactly this reason: you preserve the signal even while removing it from the gate.

Operational guardrails keep quarantine bounded:

Quarantine automatically when a test exceeds a defined failure rate threshold and create a tracking ticket with an owner.
Run quarantined tests in a separate CI job so the main pipeline stays reliable while results are still collected.
Set a flakiness budget (fewer than 2% of tests flaky at any time) and treat overruns as a priority.

The two cautionary examples from industry are worth noting. Microsoft's quarantine system runs three phases: inference (identifying flaky tests from test telemetry), bug filing (assigning ownership and notifying developers), and mitigation (quarantining known flaky tests so they don't block pipelines). Critically, tests are removed from quarantine once a developer closes the associated bug, ensuring permanent coverage loss is avoided. Meta built a Probabilistic Flakiness Score after recognizing that all real-world tests are flaky to some degree, and that the right question is not whether a test is flaky but how flaky it is. The score provides teams with a way to monitor test reliability over time and react quickly to regressions within the suite.

Where Manual Detection Hits Its Ceiling at Enterprise Scale

At a certain suite size, manual investigation of flaky tests becomes infeasible. A five-year industrial case study of roughly 30 developers found that flaky tests consumed at least 2.5% of total productive developer time, and that's in a team where people were actively trying to manage them. The investigation cost per failure averaged $5.67, while an automated rerun costs approximately 0.02 cents in computing. The cost gap is extreme, but the bigger problem is that reruns don't produce a diagnosis. They produce a pass/fail observation, which tells you nothing about what caused the failure or whether it will recur.

Scale pressure	Evidence signal	Why manual triage fails	Automation need	Risk if ignored
Developer time	At least 2.5% of productive time	Human investigation becomes the bottleneck	Track failure history automatically	CI confidence erodes
Investigation cost	$5.67 per failure	One-off analysis costs more than reruns	Prioritize expensive failures	Red builds become noise
Compute cost	~0.02 cents per flaky-test rerun	Cheap reruns normalize ignored failures	Compare repeated outcomes	Same-commit flips are hidden
Rerun ceiling	Up to 10,000 reruns	Rare flakes still escape confirmation	Add prediction and history signals	Confirmed flakes stay undetected
Diagnostic scope	Runs, runners, orders, and environments	Logs show only a single run	Compare behavior across contexts	Root causes remain unresolved

Logs record a single run. Diagnosing flakiness requires comparing behavior across runs, runners, orders, and environments, which is exactly what logs don't show. Flaky test rate is one of the more telling code quality metrics to track precisely because it degrades CI reliability in ways that other signals don't capture.

Augment Cosmos's Parallel Tool Calls lets agents run coverage checks, log comparisons, and order-dependency scans simultaneously rather than sequentially, returning results faster while keeping investigation steps explicit. Augment Cosmos can also compare repeated CI outcomes before quarantine decisions are made, turning one-off investigations into auditable workflows with shared context across the team's codebase history.

Category-Aware Detection: Where to Start This Sprint

Start with differential coverage for non-order-dependent flakes: it adds minimal overhead and requires no reruns. Layer in order-dependency checks if your suite is Python-heavy or relies on shared fixtures. Then set up flip-rate tracking across commits so quarantine decisions are driven by data rather than developer memory. That sequencing matters because reruns cost fractions of a cent per run while developer investigation averages $5.67 per failure.

Augment Cosmos with Context Engine connects repeated CI outcomes to code, test, and infrastructure context during verifier-stage investigation, keeping root-cause checks close to the pre-deploy decision point.

Flaky Test Detection and Remediation

TL;DR

The New Code Review Workflow for AI-Native Engineering Teams

Why Flaky Tests Break CI Confidence Faster Than Real Bugs

Four Root Cause Categories Behind Flaky Tests

Asynchronous Wait

Concurrency and Race Conditions

Test Order Dependency and Shared Mutable State

Environment and Platform Dependencies

How to Detect Flaky Tests Systematically Before They Erode CI Confidence

Differential Coverage Detection With DeFlaker

Order-Dependency Detection With iDFlakies

ML Prediction With FlakeFlagger and Flakify

Statistical Pass/Fail History Analysis

The Retry, Quarantine, Trust-Erosion Cycle That Teams Keep Rediscovering

Stage One: Retries Become the Default Response

Stage Two: Quarantine Escalates the Coverage Loss

Stage Three: Trust in the Whole Suite Collapses

Flaky Test Remediation: Fixing the Test vs. Fixing the Non-Determinism

The Core Distinction Between Test Bug and Production Bug

Fix Strategies Mapped to Root Cause

Quarantine Flaky Tests as a Waypoint, Not a Destination

Where Manual Detection Hits Its Ceiling at Enterprise Scale

Category-Aware Detection: Where to Start This Sprint

Frequently Asked Questions About Flaky Test Detection

Written by

Paula Hingel

Give your codebase the agents it deserves

TL;DR

The New Code Review Workflow for AI-Native Engineering Teams

Why Flaky Tests Break CI Confidence Faster Than Real Bugs

Four Root Cause Categories Behind Flaky Tests

Asynchronous Wait

Concurrency and Race Conditions

Test Order Dependency and Shared Mutable State

Environment and Platform Dependencies

How to Detect Flaky Tests Systematically Before They Erode CI Confidence

Differential Coverage Detection With DeFlaker

Order-Dependency Detection With iDFlakies

ML Prediction With FlakeFlagger and Flakify

Statistical Pass/Fail History Analysis

The Retry, Quarantine, Trust-Erosion Cycle That Teams Keep Rediscovering

Stage One: Retries Become the Default Response

Stage Two: Quarantine Escalates the Coverage Loss

Stage Three: Trust in the Whole Suite Collapses

Flaky Test Remediation: Fixing the Test vs. Fixing the Non-Determinism

The Core Distinction Between Test Bug and Production Bug

Fix Strategies Mapped to Root Cause

Quarantine Flaky Tests as a Waypoint, Not a Destination

Where Manual Detection Hits Its Ceiling at Enterprise Scale

Category-Aware Detection: Where to Start This Sprint

Frequently Asked Questions About Flaky Test Detection

What is the difference between detecting and fixing a flaky test?

How many times should engineering teams rerun a test to confirm flakiness?

Does flaky test detection belong in pre-deploy CI or post-deploy monitoring?

Why do tests pass locally but fail in CI?

Should engineering teams quarantine or delete a flaky test?

Related Guides

Written by

Paula Hingel

Give your codebase the agents it deserves