A/B testing is a popular way to constantly improve website (in terms of conversion rate) through experiments. Many of the time, some seemingly insignificant changes can have a big impact on the results. In a website A/B testing, the tester makes a variant of the website page. When a visitor comes, the website randomly picks one page to show to the visitor. So we have 2 groups of visitors: the visitors who see the original page (group A), and the visitors who see the variant page (group B). After a period of time, if the conversion rate of group B is higher, we claim that the variant is a better version of the page.
Everything seems perfectly make sense. However, the biggest concern is right hidden in the sentence: How long is the “period of time”? 2 days? 2 weeks? Or does 2 months enough? Some blogs say in the other aspects: 1,000 visitors, 10K visitors, or 100 conversions. Unfortunately, none of them speaks in a scientific way. Some other blogs promote their premium products that decide when you can stop the experiment. However, why should you pay a premium fee just for that simple question?
There is really just one step you should do: calculate the required number of samples and wait until it is reached. So the question is, how to calculate it.
Calculate the required number of samples
An A/B testing is essentially a hypothesis testing. It’s data follow the binomial distribution, i.e. each of the visitor either converts or not. So there is a formula to calculate the required samples number. You don’t really have to apply the formula your self. Just go to the website here:
- Input the original conversion rate in the (1) “Baseline conversion rate” (say 8%).
- Input the expected effect of the change to the (3) “Minimum Detectable Effect”. For example, you think the variant could bring a 10% increase, then input 10%. This sounds tricky at the beginning, because most of the time, you do not know how much it can increase. So you can just go with your hunch. Some companies may think any change less than 10% is not useful. In that case just input 10%. Select “Relative” in (3) because we assume it increases by 10%. We will talk about this again later.
- Leave the “Statistical power and significance level” (4) to the default values for now. The default values are those that make an experiment result considered acceptable.
So you will see, the calculated sample size is (5) 18,296 per variation, which means, wait until each variant receives that many visitors.
Coming back to 2. What if the guessed 10% is wrong? For example, after 2 days, you find that, from the current results, seems the variant can increase by 30%! Or, looks like it can only win by 5%. In this case, you may adjust your expectation to these numbers, and calculate again. Well, this is not that scientific, but in practice, it helps you adjust your goals quickly.
What if I have some existing data?
Sometime, you already have some data. e.g. as below:
How to interpret this result? Does this mean group2 is better than group1? Or is it because of random factors (just by chance)?
You should first calculate whether it is statistically significant. Since this is a frequency number instead quantative values (e.g. the height of 2 groups of students), a Chi-Squared test is chosen instend of t-test. Again, the calculation is simple. Go to the website here:
Input the data you have to the blank places, and it will show the result.
However, Don’t stop here! Being statistically significant doesn’t mean you have had enough samples to support the conclusion. Go to the first page and calculate the samples required again!
Now we see that, if we want to validate this result, we need 390 samples for each variantion, instead of 200. So don’t stop the experiment yet!
Before starting your A/B testing, make sure you calculated the required sample size and stick to it! You don’t need a premium tool to do that. Simply calculate from an online free tool and you will get what you need.