Experimentation

When your sequential test says "keep running" at Z = 3.997

4 min readBy David Arzumanian

You ran a test. Z-score is 3.997. The lift is +36%. The fixed-sample p-value would have you popping champagne. Your sequential boundary says: keep running.

This is not the framework being cautious. It is doing exactly what you paid it to do.

The mental model

When you set up a sequential test with O'Brien-Fleming boundaries, you are not running "a test with peeks." You are committing to spend your false-positive budget across the entire planned duration. Early on, the boundary is deliberately brutal, with Z thresholds north of 4, because every peek is a chance to fool yourself. The budget gets cheaper to spend as you accumulate information. By the planned end, the boundary relaxes to something close to the familiar 1.96.

So Z = 3.997 at 30% of your target sample is telling you something precise: this result, this early, is not strong enough to rule out a lucky peek. A +36% lift sounds enormous. Early-stage variance is enormous too. The smaller the sample, the easier it is for a real-but-small effect, or pure noise, to throw a giant relative number.

Three things that trip people up

A fixed-sample p-value at an interim peek is not a valid stopping rule. Stop the moment p < 0.05 on any peek and your real false-positive rate is closer to 20 to 30%, not 5%.
"Almost crossed" is not "crossed." Z = 3.997 against a boundary of 4.10 is a miss. The point of pre-committed boundaries is that you do not get to relitigate them when you do not like the answer.
The boundary changing over time is a feature. It encodes that early evidence is cheaper than late evidence. Treating it as a static threshold misses the entire design.

The right call at 3.997

Keep running. Trust the design you committed to before you saw any data. If the effect is real, cumulative Z will keep rising and you will cross cleanly. If it was early-sample noise dressed as a win, it will regress, and you will be glad you did not ship it.

The discipline is not in the math. It is in not flinching when a 3.997 stares back at you.

This is also the gap most teams do not know they have. Optimizely, Statsig, Eppo, and GrowthBook will send a significance alert the moment a fixed-sample threshold trips. Very few give you a real spending function, futility bounds, and an honest ETA-to-decision. The "stop early, safely" capability is exactly the layer the default tool leaves out, and it is usually where a third to forty percent of a test's runtime is hiding.

Are your tests being stopped at the right moment?

Take the free 90-second pipeline diagnostic, or book a call and we will pressure-test how your team decides to stop.

Take the 90-second diagnostic Book a 20-min call

← All writing