Experimentation

Why your A/B test winner didn't hold up

5 min readBy David Arzumanian

You ran the test. The variant won. You shipped it. A month later the metric is flat, or worse, and nobody can explain where the lift went. This is one of the most common and most expensive failures in product experimentation, and it almost always comes down to four culprits.

Here is how to tell which one bit you, and how to stop it happening again.

1. You stopped too early (peeking)

The most common cause is peeking: watching the test daily and stopping the moment the p-value dips below 0.05. It feels responsible. It is statistically disastrous.

A fixed-horizon test is only valid if you decide the sample size up front and look once, at the end. Every interim look you act on is another roll of the dice, and stopping on the first significant result can inflate your false-positive rate from the nominal 5% to 20 to 30%. The "winner" you shipped was often just a lucky high point in the noise, and it regressed the moment more data arrived. In this case the extra data was production traffic.

2. Regression to the mean

Closely related: extreme early results pull back toward the truth as the sample grows. If you greenlight on a small sample showing a huge lift, the most likely next thing that happens is the lift shrinks. The bigger the early effect is relative to your typical effects, the more suspicious you should be, not less.

3. Novelty and primacy effects

Some real lifts are real but temporary. A new banner gets clicks because it is new; returning users poke at the change and the metric spikes for a week, then reverts once the novelty wears off. If your test ran for a few days but your users have a weekly or monthly cycle, you measured the novelty, not the steady state.

4. Multiple comparisons

Finally, if you declared victory on whichever of ten metrics or five segments crossed the line, you did not run one test, you ran ten. With no correction, the chance that at least one shows a false "win" is far higher than 5%. The lift you shipped on was simply the metric that got lucky.

How to tell which one it was

Quick triage:

Did you stop early or look repeatedly? Suspect peeking.
Was the early sample small and the effect huge? Regression to the mean.
Did the effect decay across the test window? Novelty.
Did you pick the best of many metrics or segments? Multiple comparisons.

Often it is more than one at once.

The fix: decide the rules before you start

None of this means slow down. It means set the rules before the test runs, not after you see the result:

Pre-register the primary metric, the minimum effect worth shipping, and the sample size. One primary metric, chosen up front.
Use sequential testing if you want to stop early. Proper sequential boundaries (for example O'Brien-Fleming, calibrated with Monte Carlo on your real metric distributions) let you monitor continuously and stop as soon as the evidence is genuinely there, without inflating false positives. This is the honest version of "stop early."
Correct for multiple comparisons whenever you read more than one metric or segment.
Add guardrail metrics and a holdback so you can confirm the lift persists in production.

That is the whole difference between optimizing toward signal and optimizing toward noise.

The uncomfortable part: if your pipeline has any of these gaps, you have almost certainly shipped at least one "win" that did nothing, and killed at least one change that would have worked. The good news is that it is fixable, and fast.

Want to know which of your wins are real?

Take the free 90-second pipeline diagnostic, or book a call and we will look at a recent test together.

Take the 90-second diagnostic Book a 20-min call

← All writing