Sunday 8 April 2018

How (not) to do an A/B test? 5 common mistakes that ruin your experiment

In the previous post, we looked at the statistics behind A/B testing. Now that we know what the numbers mean, let's look at some of the common mistakes in A/B testing that ruin results and waste valuable time and resources and how we can avoid them.

Stopping the test too early 


One major assumption in calculating the significance level is that the sample size is fixed and the experiment will be stopped once that sample size is reached. In Physical and natural sciences, this is usually not violated but in digital analytics since the is so easily accessible, people have a tendency to 'peek' at the results and stop the test as soon as the significance is reached. This is a major mistake and should always be avoided. 

Let's look at our coin toss example again and see if the number of heads is significantly different from a fair coin. This time we are going to peek in in the middle of the experiment (500 coin tosses) and measure significance. There are four possible outcomes and conclusions. In Scenario I, even though we look at the results in the middle we stop the experiment only after 1000 coin tosses.

Scenario I
In the second scenario, we look at the results in between and stop the test once significance is reached. In this case, our chances of detecting a false positive has increased simply because we stopped the test once we reached significance !

Scenario II

The probability of getting a false positive increases dramatically with how frequently you 'peek' at the data. If this was an actual A/B test, we would have declared the variant successful after five hundred visitors and stopped the test even though there was no change at all! It is okay to monitor the results but equally important to resist the temptation of pausing the test before the sample size is reached !

Confirmation Bias and pitfalls of testing without a solid hypothesis


'To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of' - Ronald Fisher

While it is a good idea to be thorough and look at the results holistically, often times confirmation bias kicks in and people expect to see results where there are none. That’s mostly the case when the A/B tests do not have well defined hypotheses. At the end of the test, people usually end up looking at multiple metrics and make the decision to go ahead even if one of them shows a positive uplift. The problem with this approach is that the more metrics you look at, the more there is a chance of having a false positive result.

Let’s take an extreme example where we look at three metrics. If we consider the results at the 95% confidence interval, then the probability of detecting a false positive is 5%. With just three metrics (assuming that they are independent), the probability of getting a false positive in any one of the metrics increases to 14% ! So, the more metrics you look at, the higher the chance that one of them will show a significant change and you might be tempted to go ahead with launching it to your full traffic.

Even though it is good to keep track of all metrics, it is  important to have a clear hypothesis and a goal in mind that you can validate your test against, rather than looking at the numbers and then arriving at a hypothesis!

Novelty Effect 




Introducing a prominent change to your website that the users are not accustomed to can create more interest in your users and drive up conversions, but the question to ask here is if the uplift is because of the change you made or the novelty of the feature itself. If you have small green buttons all over your website and suddenly you changed them to big red ones, it is very likely to intrigue your returning users who are not used to seeing big buttons. Testing this with new traffic will tell if the change results in any uplift. With feature changes like these, it is always a good idea to test them against the new users to rule out the novelty effect.


Testing over incomplete cycles


Most of the business data has a seasonal component. Traffic and customer behavior usually differ by day of the week with engagement varying on different days. Depending on how strong the seasonality is this is what a typical behavior would look like.


Let's say that you run the test for three days and stop once you have the results. Since you have not tested for all days, it is very likely the results will not be reproducible once you roll it out fully. When testing, always ensure to run the test over a full cycle. Typically that's a week, though sometimes it is necessary to test over two cycles if the sample size is not sufficient.

Not Testing with all Customer Segments

Not all Customers Segments are the Same

The traffic on your site is not uniform. Some part of the traffic will be coming from Search engines, some would be direct, some users would be repeat and loyal customers while others will be interacting with your website for the first time. What you are testing may behave differently with different customer segments, so it is always important to keep this context in mind. A prime example of this is pricing. If you are A/B testing the price of a product, it is highly likely that repeat customers will have a different behavior compared to the new users so in this case it is important to separately A/B test this against these two segments.


If you are still unsure about what confidence intervals and significance levels mean and need a refresher on the A/B testing statistics, have a look at an earlier blogpost. Comment below about your experience with A/B testing. Happy Experimenting !







No comments:

Post a Comment