Running a basic A/B test is easy, but you know what else is easy? – Misinterpreting the results.
In this post, we are covering the common statistics terminology and models, and taking a closer look at the different methods of calculating A/B testing related metrics.
After going through everything, you will have a good understanding of what your testing tool does and how to calculate the metrics yourself.
Three main topics of this post:
- Frequentist statistics
- Bayesian statistics
- Revenue and Other Non-Binomial Metrics
Bookmark!
Feel free to bookmark this post. We are constantly adding new and highly relevant information.
Frequentist vs Bayesian
No, we are not going to debate on which one you should prefer. There is a bunch of articles and threads on the pros and cons of each. Long story short – there is no right and wrong, go with whatever your A/B testing tool of choice is using and get to know how it works.
In the broadest sense, A/B testing is detecting which variation of your website has the highest probability of performing the best based on set criteria (actually it’s a bit more complicated but more on that later in the post). We can determine four types of probability.
- Long-term frequencies
- Physical tendencies/propensities
- Degrees of belief
- Degrees of logical support
Frequentist inference is based on the first definition, Bayesian, on the other hand, is rooted in definitions 3 and 4.
Therefore, based the frequentist definition of probability, only repeatable random events (like the flipping of a coin) have probabilities. Furthermore, these probabilities are equal to the long-term frequency of occurrence of the events in question. It is important to understand that Frequentists don’t attach probabilities to hypotheses or to any fixed but unknown values in general. Ignoring this fact is what often leads to misinterpretations of frequentist analyses.
On the other hand, Bayesian approach views probabilities like a more general concept. Following the Bayesian technique, you can use probabilities to represent the uncertainty in any event or hypothesis. Hence, it’s perfectly acceptable to assign probabilities to non-repeatable events, like the result of your new product launch campaign. Therefore, many frequentists would say that such probabilities don’t actually make sense because the event is not repeatable. You can’t run your product launch campaign an infinite number of times.
Frequentist approach
Tools using frequentist-type statistics
- Optimizely (leveraging some Bayesian wisdom)
- Convert
You can export data from your testing tool or analytics platform and perform Frequentist tests, like Z test, yourself in tools like Excel or using programming languages like Python and R.
To interpret the results from a tool that is using Frequentist statistics, there are a few concepts that we need to understand.
Null hypothesis
A default position where conversion rate for control is equal to a conversion rate for a variation, there is no significant difference. Usually presented as H0.
Alternative hypothesis
An alternative hypothesis is a statement that is being tested against the null hypothesis is. Often presented as an H1 or Ha.
In A/B testing, an alternative hypothesis generally claims that a certain variation is performing significantly better than the control.
P-value
One of the most commonly misunderstood concepts of A/B testing and in statistics in general. Generally, people new to A/B testing tend to believe it describes the probability of a variation being a winner or a loser, compared to the control.
That, of course, is not true. P-value describes a probability of observing the observed (or greater) difference between the control and variation if you assume that your hypothesis is wrong and there should actually be no difference between the groups.
Another way to put this is that p-value describes the probability of seeing the observed difference randomly, say, in an A/A test.
P-value is one of the most important metrics in Frequentist statistics, therefore, it is not about detecting the probability of your null or alternative hypothesis of being true or false nor is it detecting the probability of your variation being better than the control. It is about rejecting or not rejecting the null hypothesis. Here’s how it goes:
- You come up with some reasonable threshold for rejecting the null hypothesis. The notation used for this threshold is α (the Greek letter alpha). This threshold is a real number between 0 and 1 (usually very close to 0).
- You promise to yourself in advance that you will reject the null hypothesis if the calculated p-value happens to be below α (and not reject it otherwise).
In A/B testing, the most common threshold α is 0.05, although sites with more traffic that want to minimize the risk of falsely rejecting their null hypothesis often pick 0.01. And sites with little traffic that are looking for quick learnings might go with 0.1.
It is a good practice to wisely choose a suitable threshold and stick to it until there’s a strong enough reason to change it. That way you can determine how many of your null hypothesis were falsely rejected.
I recommend using a statistical significance calculator to see how the chosen threshold affects your needed sample size.
Type I error
False positive – Rejecting a true null hypothesis.
In A/B testing this would mean that you call your variation a winner when it actually is worse, equal or less good than your test showed.
So, the p-value, along with the prespecified α, directly controls the type I (false positive) error rate.
Rejecting a null hypothesis (calling your variation a winner) is a result that triggers an action, most commonly ending in a change on your website (implementing your variation). Therefore, you must think wisely about what is the per cent of such false decisions that you can live with!
Type II error
False negative – Not rejecting a false null hypothesis.
In A/B testing it usually describes a situation where your variation is a winner but your test shows it is not significantly better than the control.
Statistical power
Simply put, statistical power is the probability that you will reject a null hypothesis if it is false.
The false negative rate will depend on these 3 factors:
- The size of the actual difference between the groups (which, by definition, is nonzero when the null hypothesis is false)
- The variance of the data with which you’re testing the null hypothesis
- The number of data points (your sample size)
In the real world, you only have control over the last factor, so you see why controlling the type II error is much trickier.
Source: An Intuitive Explanation Of P-Values
Confidence intervals
Confidence intervals are the frequentist way of doing parameter estimation. The technical details behind calculating and interpreting confidence intervals are beyond the scope of this post, but I’m going to give you the general overview.
Once you’ve calculated a confidence interval using 95% confidence level, it’s incorrect to say that it covers the true mean with a probability of 95% (this is a common misinterpretation). You can only say in advance that, in the long-run, 95% of the confidence intervals you’ve generated by following the same procedure will cover the true mean.
In A/B testing, let’s say you’ve calculated a confidence interval using a 95% confidence level for the conversion rate of a given variation [i.e 4.5% – 5.8%]. This means that after implementing this variation, its conversion rate will between 4.5% and 5.8% with 95% confidence.
Two-tailed test
If you are using a significance level of 0.05, a two-tailed test allots half of your alpha to testing the statistical significance in one direction and half of your alpha to testing statistical significance in the other direction. This means that .025 is in each tail of the distribution of your test statistic. When using a two-tailed test, regardless of the direction of the relationship you hypothesize, you are testing for the possibility of the relationship in both directions. A two-tailed test will test both if the mean is significantly greater than x and if the mean significantly less than x. The mean is considered significantly different from x if the test statistic is in the top 2.5% or bottom 2.5% of its probability distribution, resulting in a p-value less than 0.05.
One-tailed test
If you are using a significance level of .05, a one-tailed test allots all of your alpha to testing the statistical significance in the one direction of interest. This means that .05 is in one tail of the distribution of your test statistic. When using a one-tailed test, you are testing for the possibility of the relationship in one direction and completely disregarding the possibility of a relationship in the other direction. A one-tailed test will test either if the mean is significantly greater than x OR if the mean is significantly less than x, but not both. Then, depending on the chosen tail, the mean is significantly greater than or less than x if the test statistic is in the top 5% of its probability distribution or bottom 5% of its probability distribution, resulting in a p-value less than 0.05. The one-tailed test provides more power to detect an effect in one direction by not testing the effect in the other direction.
Here’s a good conclusion of Frequentist A/B testing, provided by Michael Frasco
In frequentist A/B testing, we use p-values to choose between two hypotheses: the null hypothesis — that there is no difference between variants A and B — and the alternative hypothesis — that variant B is different. A p-value measures the probability of observing a difference between the two variants at least as extreme as what we actually observed, given that there is no difference between the variants. Once the p-value achieves statistical significance or we’ve seen enough data, the experiment is over.
Bayesian approach
Tools using Bayesian-type statistics
- Google Optimize
- VWO
- Adobe Target
- AB Tasty
- Dynamic Yield
Although the picture was a lot different just a few years ago, Bayesian has quickly overtaken Frequentist in terms number of A/B testing tools using it as the main logic behind their stats engines.
There’s a good amount of shorter and longer articles describing why Bayesian is a better choice for those running A/B tests, for example, “The Power of Bayesian A/B Testing“, they all seem to contain the following reasoning. And of course, Frequentists would argue on several of them.
- Bayesian gets reliable results faster (with a smaller sample)
- Bayesian results are easier to understand for people without the background in statistics (Frequentist results are often misinterpreted)
- Bayesian is better at detecting small changes (Frequentist favoring the null hypothesis).
Chris Stucchio has written a comprehensive overview of Bayesian A/B testing and the following section is mostly based on his white-paper.
Important variables of Bayesian testing:
α – underlying and unobserved true metric for variant A
β – underlying and unobserved true metric for variant B
Therefore, If we choose variant A when α is less than β, our loss is β – α. If α is greater than β, we lose nothing. Our loss is the amount by which our metric decreases when we choose that variant.
ε – the threshold of expected loss for one of the variants, under which we stop the experiment
This stopping condition considers both the likelihood that β — α is greater than zero and also the magnitude of this difference. Consequently, it has two very important properties:
- It treats mistakes of different magnitudes differently. If we are uncertain about the values of α and β, there is a larger chance that we might make a big mistake. As a result, the expected loss would also be large.
- Even when we are unsure which variant is larger, we can still stop the test as soon as we are certain that the difference between the variants is small. In this case, if we make a mistake (i.e., we choose β when β < α), we can be confident that the magnitude of that mistake is very small (e.g. β = 10% and α = 10.1%). As a result, we can be confident that our decision will not lead to a large decrease in our metric.
Prior – one of the key differences between Frequentist and Bayesian is that the latter can take prior information into account. Hence, it doesn’t have to learn all the data points itself and can, therefore, reach the conclusions faster.
For example, let’s say we use a Beta(1, 1) distribution as the prior for a Bernoulli distribution. After observing 40 successes and 60 failures, our posterior distribution is a Beta(41, 61)⁶. However, if we had started with a Beta(8, 12) distribution as our prior, we would only need to observe 32 successes and 48 failures in order to obtain the same distribution as before.
In general, it is suggested to choose priors that are a bit weaker than what the historical data suggest.
Most Bayesian-based A/B testing tools, like VWO, present their results using three key metrics
- Relative improvement VS control – a range by which the observed metric for the variation is better or worse than the same metric for the control. The range is calculated for a 99% probability. The more data the test collects, the smaller this range gets.
- Absolute potential loss – the potential loss is the lift you can lose out on if you deploy A as the winner when B is actually better.
- Chance to beat control/all – probability of the variation being better than the control/all other variations.
More on How VWO Calculates a Winning Variation
Just like with many Frequentist-based A/B testing tools, several Bayesan-based tools will let you choose some version of significance for the results your test is going to generate. In Frequentist, this is usually the p-value or confidence level, with Bayesian, you are likely to see some options to choose from. For example, this is what VWO gives you:
Quick learning
For finding quick trends where tests don’t affect your revenue directly |
You can choose this mode when testing non-revenue goals such as the bounce rate and time spent on a page or for quick headline tests. With this mode, you can reduce your testing time for non-critical tests when there isn’t a risk of hurting your revenue directly by deploying a false winner. |
Balanced
Ideal for most tests. |
As the name suggests, it is the best balance between the testing time and minimizing the potential loss. |
High certainty
Best for revenue-critical tests when you want to absolutely minimize the potential loss. Usually takes the longest to conclude a test. |
This is the default mode and can be used for almost all tests. Suppose you have an eCommerce website and you want to test changes to your checkout flow. You want to be as certain as possible to minimize the potential loss from deploying a false winner even if it takes a lot of time. This is the best mode for such critical tests which affect your revenue directly. |
I think VWO’s approach is better for people without a background in statistics – it’s quite easy to mistakenly choose too weak confidence level without realizing the consequences.
Working with Revenue and Other Non-Binomial Metrics
How you (or the machine) calculate the results for an A/B test depends heavily on whether you are testing a binomial or non-binomial metric.
Here are some common non-binomial metrics used in A/B testing:
- Average order value
- Average revenue per user
- Average sessions per user
- Average session duration
- Average pages per session
The key difference between binomial and non-binomial metrics is that former can have only two possible values: conversion or no conversion, true or false etc. Non-binomial metrics, on the other hand, have a range of possible values, i.e from zero to infinity when measuring revenue.
This is a big difference, and without going into too much detail, it is quite clear that you cannot use the same algorithms for calculating the results for both. Mainly because of the lack of normal distribution for non-binomial metrics.
Further reading on whether or not you can use the same types of tests for both types of metrics:
- Testing Differences in Revenue? You’re Probably Not Using the Correct Statistics
- Your Average Revenue Per Customer is Meaningless
One of the most commonly used tests for testing non-binomial metrics is a Mann-Whitney-Wilcoxon rank-sum test. Unlike the t-test, it does not require the assumption of normal distributions.
BTW, if you have some experience with Python, setting one up for yourself is not too difficult. Here’s what you’ll need.
Another option is to use the following process suggested by Georgi Georgiev in his blog post.
- Extract user-level data (orders, revenue) or session-level data (session duration, pages per session) or order-level data (revenue, number of items) for the control and the variant
- Calculate the sample standard deviation of each
- Calculate the pooled standard error of the mean
- Use the SEM in any significance calculator / software that supports the specification of SEM in calculations
The key difference from binomial metrics is that no matter which method you choose, you will be working with user/session/order-level data, that is, you must feed the algorithm all the rows instead of totals.
Statistics plays a huge role in A/B testing and it is absolutely a must to know at least the basics of Frequentists, Bayesian and non-binomial metrics. That way you can choose the right tools and, hopefully, learn to know them (and stats they use) in depth.
I hope this post gave you a good starting point and hopefully you learned something new.
Did we miss something important? Suggestions are welcome in the comments below – so are the questions.