(6 min read)
In this post, I intend to prove what some consider obvious, and what others consider preposterous: Pursuing statistical significance in a business test is the wrong approach (except in cases with a luxurious surplus of time, data, and resources).
I also explain how statistical significance, confidence, and progress are best achieved with high-quality experiments that provide variance and not by waiting for more data to come in.
While there's other critical factors to business experimentation—like rooting out biased user behaviors, or heightening quality of new ideas—correcting this common misuse of statistical significance will expedite what's really desired: real improvement.
We're running an A/B test with the objective of improving a key performance metric in our business, such as customer engagement or purchase rates:
'Variation A' versus 'Variation B'
So far, Variation A is performing better, with an 85% significance level. Results are not yet statistically significant; not even close, by convention, which begs 95–99% significance level.
What should we do?
A Simplified 'Game Theory' Analysis
Laying out our options in the form of sequential decisions and effects
Option 1 — We extend the experiment in pursuit of statistical significance.
We don't know what will happen next, but here are the possibilities:
Case 1a — Significance level increases. If significance level increases, we'll choose Variation A as our 'winner' with the benefit of confidence. But that confidence comes at a cost of time and budget, which could otherwise be used to test for next improvements.
➥ (1 benefit, 1 cost)
Case 1b — Significance level stays the same. In this case, results have hardly changed, meaning nothing new is achieved, and we are back at the same juncture as before, while time and budget have been wasted.
➥ (0 benefits, 1 cost)
Case 1c — Significance level decreases. In this case, results have leveled out, and we can not determine which variation is better. Our options will be to extend the experiment further, or just choose one variation (or none) and move on. We have lost time and resources. (Hopefully we don't stare at results for a month before moving on.)
➥ (0 benefits, 1 cost)
Case 1d — Significance level swings wildly to show confidence in Variation B. This case is unlikely, but if ever to materialize, we know it would take a very long time, given the amount of data piled up so far in favor of Variation A. It's the most trying and costly case, forcing us to repeatedly extend the experiment as results oscillate across the full spectrum of significance, without any assurance statistical significance will ever be achieved. It could take several months before it's achieved, or finally abandoned.
➥ (1 benefit, 1 big cost)
Perhaps the most important thing to consider with Case 1d is that as time goes on, it becomes increasingly likely that any 'winner' will provide a diminished performance improvement over other variations in the experiment, thus diminishing potential benefit to the business, as costs increase.
All four case outcomes, 1a, 1b, 1c, and 1d have major downsides. Option 1 is not attractive.
Option 2 — We conclude the experiment, declare Variation A the winner, and move on to new efforts.
Benefit #1 — We save time and budget and move on to a new experiment, or put resources elsewhere.
Benefit #2 — It's very likely to enjoy improved results by choosing Variation A.
Drawback #1 — We lack certainty that Variation A will always outperform Variation B. It is possible, but very unlikely that results actually suffer due to this choice.
➥ (2 benefits, 1 cost)
Option 2 is all positive, except for a very small risk. That risk, though, is mitigated by the opportunity to continue making improvements.
Option 3 — We conclude the experiment, declare Variation B the winner, and move on to new efforts.
Benefit #1 — Like with Option 2, we save time and resources.
Drawback #1 — This is a nonsensical choice, given the information we have. Results may not be significant, but you could say it's highly unlikely (< 5% chance) that Variation B is significantly better than Variation A. Future results will probably suffer due to this choice.
➥ (1 benefit, 1 big cost)
Option 3 doesn't make sense.
Option 2 provides us the most value; a mild gamble in itself, but it provides the best opportunity to help the business moving forward.
Even in the event of Case 1a, where we reach statistical significance with Variation A, looking back we would always prefer we had declared Variation A the winner earlier, in order to seize more time for our business and continue furthering progress.
People often proclaim (with misplaced confidence) that it's improper to conclude the experiment early, but they have failed to evaluate that Cases 1a, 1b, & 1c are never optimal, and that entertaining the event of Case 1d implies a moonshot gamble with paralyzing cost and low reward. Further, to stick with the current experiment is to neglect the potential reward of yet newer, better variations.
Using Common Sense
In some scenarios, all you need is practical thinking conducive to actually making the improvements you're looking for. For example, consider the scenario where you’re comparing the results of two variations and trying to determine confidence in a winner.
Ask this: Are both variations significantly worse than your goal results?
If yes, then it's irrelevant which one is significantly better or not. Go make Variations C & D.
Additional Key Suggestion: In order to best harness the cost/benefit effects discussed, set your initial experiment objectives to make variations dramatically different, so you're more likely to find a real winner, faster.
The classic conversion optimization example is to isolate factors—like the color of a button—and see if one performs significantly better. That is, of course, a royal waste of time and resources. Even isolating more interesting factors can still often be a waste of time, and you're less likely to bear fruit for your optimization efforts without more inspiration for significant improvement. This applies to small or big businesses with small or big data sets. Do some real work and create something new and different.
You could also use backward induction to assume hypothetical truths about your variants and user behavior, and work backwards to trace optimal decisions, and reach the same conclusions. This analysis was removed to avoid redundancy.
Appendix I: Short & Sweet Summary
By shortening experiment duration, you're effectively cutting your business's time and cost of learning by 50-90%, while sacrificing no more than 10% statistical confidence, which in the end is a bit superfluous, anyway.
We all agree confidence is important, but we best acquire it by prioritizing quality and variance before the experiment starts, and not by paralyzing the business to await statistical significance.
Appendix II: Optimization & Experimentation Tips
- Seek, firstly, to improve the user's experience. (i) Brainstorm transformative improvements; and (ii) Don't gloss over little detractions, which can have profound impact on bottom line results.
- Seek, secondly, to maximize the variance between variations. If successful, you'll need less data to reach a conclusion, and reward potential is greater, should your experiment mark an improvement.
- Be careful making experiments that isolate factors or seek to learn tidbits about user preferences. Learnt tidbits might not carry over with success in the future, anyway; and moreover, these optimizations aren't nearly as impactful as when you're inspired to create something game-changing. Experimenting with isolated factors is often a sign of laziness, and a failure to apply good decision-making criteria without the need for experimentation—with questions such as, "which options provide the most ease or information or transparency for users"—and subjects users to inferior experiences while the experiment is running. If experimenting with isolated factors, make sure you expect the change to cause a drastic divergence in performance.
- If seeking growth, an 85% significance level is a good place to conclude an experiment. It represents at least a 70% chance that you've got a real winner, and < 5% chance you're making a mistake. Don't feel obligated to extend an experiment beyond 20-50 conversions per variant, though, regardless of confidence. Lastly—and this should be obvious—halt any test where results measure significantly worse than your short-term goal results.
- If you're strapped for resources and the task of creating new, high variance, high-quality variations is too tall, then get creative and test small, easy changes that actually will create big variance (like pricing).
Editor's Note: Even scientists increasingly seek to abandon use of 'statistical significance' in scientific settings starting throughout the 2010s. The latest push in 2019 by the American Statistical Association was signed by over 800 statisticians and scientists, mainly because it's an arbitrary convention, generally bastardized for ulterior motives or brownie points. It only represents a momentary state or cross section of data, which is partly why I—someone who was on the Math Team in high school, went to a top engineering university, and love building object oriented databases—generally choose not to overemphasize precise measurements and calculations. In my opinion, to get hung up on exact measurements in business is to not really get it. Everyone can and should measure, but calibrating work based on brownie-points-level specificity often precludes the critical, practical progress you needed.