Notes on making experiments trustworthy
Field notes from the unglamorous half of experimentation: where wins turn out to be noise, metrics quietly disagree, and rigor is the difference between shipping on signal and shipping on luck.
Why your A/B test winner didn't hold up
You shipped a winning variant and the lift vanished. The four reasons winners fade after launch — peeking, regression to the mean, novelty effects, and multiple comparisons — and how to stop shipping on noise.
Read the essay →When your sequential test says "keep running" at Z = 3.997
A +36% lift and a 3.997 Z-score, and the boundary still says keep running. Why that's the O'Brien-Fleming design working as paid for, and the discipline of not flinching at an interim peek.
Read the essay →Your A/B test revenue metric was inflating 59x. The bug was one line.
Every variant showed one purchaser and zero revenue, except one cell. The culprit was a single ifNull in the revenue CTE, and it survived every aggregate sanity check.
Read the essay →When churn and charges both rise in the same A/B test
A retention feature lifted engagement and both ends of the funnel at once. Why polarization hides in the net, and the action-vs-outcome diagnostic that catches it.
Read the essay →More essays coming, on guardrail metrics, attribution drift across channels, and measuring launches you can't A/B test.