CI: The Devils In The Details

Ah, it all started out so well.

You bootstrapped your CI alongside the application. Tests came in as features shipped. You tweaked performance here and there. Commits flowed through and everything moved with pleasing alacrity. Every failure meant a real regression. Every green build was just business as usual. You didn’t have to think about it.

And then one day, you got an unexpected failure.

“Hm, that’s weird. Oh, it looks like this new service you added just wasn’t fully warmed up when CI ran this time. It’s fine, the rerun should be green.”
And it is. Nothing to worry about.

Then it happens again a week later. Fixing the startup sequence properly would take time, and honestly, the service doesn’t restart that often in production anyway. Hard to justify burning a day on this when rerunning CI costs you 15 extra minutes and you’re already late on other work.

Three months later you wake up and CI has turned into a treasure-hoarding, fire-breathing dragon.

The pipeline takes four hours and fails randomly. That lazy service is now just one of ten intermittent issues stacked into a wall of sadness. Every run you flip a coin 20 times in a row, and if any one of them lands the wrong way, you get the dreaded red circle and deployment is blocked until you fix it yourself.

A little bit of uncertainty

That first flaky failure from the intro didn’t feel like a turning point. It felt like noise. The system still worked. Most builds were green. Progress continued.

The problem is scale, and it’s just probability.

If each test has an independent flake rate p per run, and you run n tests, then:

Probability a given test doesn’t flake: 1 - p
Probability none of the n tests flake: (1 - p)^n
Probability at least one test flakes (so the pipeline goes red “for no reason”):
1 - (1 - p)^n

Now plug in something that sounds harmless:

p = 0.01 (1% flake rate)
n = 300 tests

P(at least one flake) = 1 - 0.99^300

A quick approximation helps: 0.99^300 ≈ e^(300 * ln 0.99) ≈ e^(-3.015) ≈ 0.049

So:

1 - 0.99^300 ≈ 1 - 0.049 = 0.951

That’s about a 95% chance that every single pipeline run hits at least one flaky failure.

Nothing obviously broke. You just crossed an invisible threshold where “rare” stopped meaning rare.

This is also why it creeps up on companies. Products grow, test counts grow, and teams add tests under deadline pressure. Each individual test is “mostly fine.” Nobody experiences the full probability curve until CI becomes a daily tax.

And ownership is usually split. Feature teams add tests. A platform team owns the CI machinery. Reviewers focus on correctness, not long-term flake rates. No one is watching a single number that says “your background uncertainty is rising.”

By the time people agree that “CI feels flaky lately,” the math has already turned against you. You’ve slowly created a system where random failure has become the default.

How Pipelines Quietly Get Slower

Flakiness erodes trust. Runtime erodes patience. And just like uncertainty, slowness rarely shows up all at once.

No one decides, “Let’s make CI take four hours.” It grows a few minutes at a time.

A product expands, and the tests expand with it. Someone adds one more end-to-end scenario because a bug slipped through last week. Then another. Then a more “realistic” dataset gets pulled in because the tiny fixtures no longer reflect production. Before long, what used to be a quick integration check is simulating half your business in one go.

This is where teams quietly forget what CI is actually for.

CI is there to tell you whether your latest change still integrates correctly with the rest of the system. It’s a coordination checkpoint. It is not the place to prove that every algorithm behaves correctly across millions of data combinations or that every edge case in a heavy processing pipeline has been exercised.

But those concerns are real, so they creep in. A team working on a compute-heavy component adds larger and larger datasets to CI “just to be safe.” Another adds long-running validation jobs that used to be run ad hoc. Over time, CI becomes the dumping ground for every kind of confidence check, from fast integration signals to exhaustive correctness testing.

The cost multiplies quietly. A test that used to process 50 records now processes 50,000. A job that used to validate structure now runs full statistical analysis. Nothing looks outrageous in isolation. Each change is justified by a real incident or a reasonable fear.

Then workflows start getting stitched together.

Bugs appear in the seams, so tests become longer journeys: create → transform → sync → export → notify. Instead of many small, parallel checks, you get fewer but longer chains. To avoid repeating setup, tests begin sharing state, environments, and data. That reduces duplication but also reduces parallelism and increases runtime.

Because the slowdown is gradual, people adapt. They stop expecting CI to be fast. They context-switch while waiting. Feedback loops stretch from minutes to hours, and that quietly changes how development feels.

By the time someone finally graphs pipeline duration, there’s no single culprit to fix. CI didn’t get slow because of one bad decision. It got slow because it slowly stopped being a lean integration signal and turned into a giant, do-everything test harness.

Compounding the Problems

Individually, slowness and flakiness are frustrating. Together, they’re brutal.

A four-hour pipeline that is rock solid is painful, but at least it’s predictable. You wait, you get an answer, you move on. A fast pipeline that fails randomly is annoying, but reruns are cheap, so people tolerate it.

Now combine the two.

Take a pipeline that runs for 90 minutes and has, say, a 30% chance of failing due to flaky tests or environmental noise. The expected number of runs before you get a clean pass isn’t one, it’s 1 / 0.7 ≈ 1.43. That means your “90-minute pipeline” is actually a two-hour pipeline on average, and that’s before any real failures.

Push that failure rate to 50% and things get ugly fast. Now the expected number of runs is 1 / 0.5 = 2. Your 90-minute pipeline quietly turns into a three-hour feedback loop. And that’s just the average. Some unlucky commits will need three or four full runs before they go green.

This is usually where developers start to give up. They stop wanting to push small changes because passing CI will take longer than the work itself. Why spend two hours building something if getting it through CI is likely to cost four?

So they disengage. Work piles up on feature branches. Integration happens in big, painful batches every few weeks, where someone blocks out a day just to wrestle changes through the pipeline.

Technically, CI still exists. Jobs still run. Dashboards are still green and red.

The C is completely silent at that point.

Keeping the Dragon Small

None of this is fixed with a tool switch or a heroic cleanup sprint. CI health requires ongoing maintenance.

If you don’t track how it behaves over time, you only notice problems once they’re already painful. CI needs the same kind of visibility you expect from production systems.

Some metrics that are especially good at exposing slow, quiet decay:

Outcome rate per pipeline run: out of 100% of runs, which pass, which fail due to real regressions, which fail due to likely flakiness (same commit goes green on rerun), and which fail due to infrastructure or stage instability.
Rerun rate: how often pipelines are rerun, and how many attempts it takes on average to get a green build.
Pipeline duration: median and p75 runtime, ideally split into queue time versus actual execution time.
Daily release rate to any environment, so you can see if delivery is slowing down.
Production delivery frequency: deploys per day and per month.
Change failure rate: how often production deploys lead to rollback, hotfix, or incident.
Lead time for changes: time from merge to running in production, tracked as median and p75.

If some of these sound familiar, they should. They overlap heavily with the DORA metrics: deployment frequency, lead time for changes, and change failure rate are all downstream of CI behavior. When CI gets slower or less trustworthy, those numbers drift almost immediately. Watching them gives you an external signal that your internal pipeline is starting to struggle.

Flaky tests need an owner and a policy. Not “we’ll get to it someday,” but “this is a bug.” Some teams quarantine new tests until they’ve proven stable. Others run nightly stability jobs that hammer the suite repeatedly just to surface non-determinism early. The details vary, but the principle is the same: instability is not background noise, it’s technical debt with interest.

Runtime needs the same discipline. Integration CI should stay focused on fast signals about whether the system still works together. Heavy data processing, exhaustive correctness checks, and long-running validations belong in separate pipelines with different expectations. If everything is critical, nothing is fast.

The goal is a pipeline that is fast, trusted, and boring enough that developers barely think about it — not one that tries to prove everything.

That’s when CI is doing its job.