Guide to Statistical Tests for A/B Analysis

Statistical tests are fundamental tools for data analysis and informed decision-making. Choosing the appropriate test depends on the characteristics of the data, the hypotheses to be tested, and the underlying assumptions.

In this blog, I have separately covered each of the main statistical tests with dedicated articles. It is indeed crucial to understand the applicability conditions of each test to obtain reliable results and correct interpretations.

What I aim to do in this article is to provide an “overview,” placing side by side the most common tests that can find daily applicability for a multitude of analyses related to the world of web marketing and for effective A/B tests. This is a first comparative look, which should ideally push for the necessary in-depth study of each individual topic, but which I wanted to accompany with very simple practical examples, in order to stimulate the reader’s curiosity.

The Tests We Will Discuss

Il Test Z
Student's t-Test
Welch's t-Test
The Chi-square Test
Analysis of Variance (ANOVA)
Mann-Whitney U Test
Fisher's Exact Test
An Overview: a Table

Il Test Z

The Z test is a statistical hypothesis test used to verify if the sample mean differs significantly from the population mean, when the population variance is known and the sample size is large (usually greater than 30).

The Z test applies when the following conditions are met:

The sample size is large (n > 30)
The population variance is known
The data is approximately normally distributed

he Z test is used to determine if there is a significant difference between two means of proportions, such as click-through rates. It can be used, for example, to verify if the introduction of a new feature on a website has led to a significant increase in the conversion rate.

Example Case: An e-commerce site wants to test if a new version of the shopping cart has improved the conversion rate. The previous conversion rate is 5% with a known variance of 0.0025. After collecting a sample of 500 users, the new observed conversion rate is 6%. Let’s verify if the difference is statistically significant using the Z test.

# Original conversion rate
p0 <- 0.05
# Original variance
var0 <- 0.0025
# Sample size
n <- 500
# Observed conversion rate
p1 <- 0.06

# Z test calculation
z <- (p1 - p0) / sqrt(var0/n)
z

[1] 4.472136

The observed z value is 4.47. Assuming a significance level of 0.05, the critical z value is 1.96. Since the observed value is greater than 1.96, we can reject the null hypothesis and conclude that the difference in the conversion rate is statistically significant.

Student’s t-Test

Student’s t-test is a statistical hypothesis test used to verify if the mean of a sample differs significantly from a hypothetical value or if two samples have significantly different means. This test applies when the population variance is unknown and the sample size is small (usually less than 30).

Student’s t-test applies when the following conditions are met:

The sample size is small (n < 30)
The population variance is unknown
The data is approximately normally distributed

Student’s t-test is used to compare the means of two distinct groups, such as the average time spent on the site for users who saw variant A compared to those who saw variant B.

Example Case: A company wants to test if a new landing page has an impact on the average time spent on the site. An A/B experiment is conducted with 20 users for each group. The average time spent on the site for the control group is 3 minutes, while for the test group it is 4 minutes. Let’s verify if the difference is statistically significant using Student’s t-test.

# Control group data
control <- c(2.5, 3.1, 2.8, 3.2, 2.9, 3.5, 3.0, 2.7, 3.3, 2.6, 3.4, 3.1, 2.8, 2.9, 3.2, 3.0, 3.1, 2.7, 3.3, 2.8)

# Test group data
test <- c(3.8, 4.2, 3.9, 4.1, 4.3, 3.7, 4.5, 4.0, 3.6, 4.2, 4.1, 3.9, 4.3, 3.8, 4.0, 4.2, 3.7, 4.4, 4.1, 3.9)

# Student's t-test
t.test(test, control, alternative = "greater")

data:  test and control
t = 12.585, df = 37.611, p-value = 2.354e-15
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 0.900641      Inf
sample estimates:
mean of x mean of y 
    4.035     2.995

Student’s t-test provides a p-value less than the significance level of 0.05; therefore, we can reject the null hypothesis and conclude that the difference in average time spent on the site between the two groups is statistically significant.

Welch’s t-Test

Welch’s t-test is a variant of Student’s t-test that does not require the assumption of equal variances between the two samples. This test applies when the sample sizes and variances are different.

Welch’s t-test applies when the following conditions are met:

The sample sizes are different
The sample variances are different
The data is approximately normally distributed

Welch’s t-test is used to compare the means of two distinct groups, such as the average income of users who made a purchase on an e-commerce site compared to those who did not make purchases.

Example Case: A company wants to test if the average income of users who made a purchase differs from that of users who did not make purchases. An experiment is conducted with 30 users who made a purchase and 20 users who did not make purchases. The average income of users who made a purchase is $50,000, while that of users who did not make purchases is $40,000. Let’s verify if the difference is statistically significant using Welch’s t-test.

# Buyers group data
buyers <- c(48000, 52000, 49000, 51000, 47000, 55000, 53000, 50000, 46000, 54000,
            49000, 52000, 51000, 48000, 53000, 47000, 54000, 50000, 49000, 52000,
            48000, 51000, 53000, 47000, 52000, 49000, 50000, 51000, 48000, 53000)

# Non-buyers group data
non_buyers <- c(38000, 42000, 39000, 41000, 37000, 43000, 40000, 39000, 42000, 38000,
                41000, 40000, 39000, 42000, 37000, 41000, 38000, 39000, 40000, 41000)

# Welch's t-test
t.test(buyers, non_buyers, alternative = "greater", var.equal = FALSE)

Welch Two Sample t-test

data:  buyers and non_buyers
t = 17.811, df = 47.626, p-value < 2.2e-16
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 9556.368      Inf
sample estimates:
mean of x mean of y 
    50400     39850

Welch’s t-test provides a p-value of 2.2e-16. Since this value is less than the significance level of 0.05, we can reject the null hypothesis and conclude that the difference in average income between users who made a purchase and those who did not make purchases is statistically significant.

The Chi-square Test

The chi-square test is a non-parametric statistical test used to verify if there is a significant relationship between two categorical variables or if the observed distribution of a categorical variable differs from the expected distribution.

The chi-square test applies when the following conditions are met:

The variables are categorical
The samples are independent
The expected frequencies in each cell of the contingency table are greater than 5

The chi-square test is used to analyze the association between two categorical variables, such as the relationship between users’ gender and preference for a particular product.

Example Case: A clothing store wants to understand if there is a relationship between users’ gender and preference for a particular product line. A survey is conducted on 200 users, of which 100 are men and 100 are women. The results show that 60 men and 40 women prefer product line A, while 40 men and 60 women prefer product line B. Let’s verify if there is a significant relationship between gender and preference using the chi-square test.

# Observed data
observed <- matrix(c(60, 40, 40, 60), nrow = 2, byrow = TRUE)
rownames(observed) <- c("Uomini", "Donne")
colnames(observed) <- c("Linea A", "Linea B")
observed

##        Line A Line B
## Men        60     40
## Women      40     60

# Chi-square test
chisq.test(observed)

## Pearson's Chi-squared test with Yates' continuity correction

## data:  observed
## X-squared = 7.22, df = 1, p-value = 0.00721

The chi-square test provides a p-value of 0.00721. Since this value is less than the significance level of 0.05, we can reject the null hypothesis and conclude that there is a significant relationship between users’ gender and preference for a particular product line.

Analysis of Variance (ANOVA)

Analysis of variance (ANOVA) is a statistical test used to compare the means of three or more groups and determine if there are significant differences between them.

Analysis of variance applies when the following conditions are met:

The data is approximately normally distributed
The variances of the groups are equal (homoscedasticity)
The samples are independent

Analysis of variance is used to compare the means of different versions of a product, different marketing strategies, or different sales techniques.

Example Case: A company wants to test the effectiveness of three different marketing strategies (A, B, and C) on average monthly revenue. 15 stores are selected for each strategy and the average monthly revenue is recorded for a period of 6 months. Let’s verify if there is a significant difference between the marketing strategies using analysis of variance.

# Dati
fatturato_A <- c(120000, 115000, 130000, 125000, 110000, 135000, 118000, 122000, 127000, 115000, 128000, 120000, 124000, 117000, 121000)
fatturato_B <- c(112000, 118000, 110000, 115000, 122000, 108000, 120000, 114000, 116000, 119000, 111000, 117000, 113000, 121000, 109000)
fatturato_C <- c(105000, 110000, 108000, 112000, 107000, 115000, 111000, 109000, 113000, 106000, 108000, 114000, 110000, 112000, 107000)

# Data
revenue_A <- c(120000, 115000, 130000, 125000, 110000, 135000, 118000, 122000, 127000, 115000, 128000, 120000, 124000, 117000, 121000)
revenue_B <- c(112000, 118000, 110000, 115000, 122000, 108000, 120000, 114000, 116000, 119000, 111000, 117000, 113000, 121000, 109000)
revenue_C <- c(105000, 110000, 108000, 112000, 107000, 115000, 111000, 109000, 113000, 106000, 108000, 114000, 110000, 112000, 107000)

# Analysis of variance
anova_result <- aov(c(revenue_A, revenue_B, revenue_C) ~ rep(c("A", "B", "C"), each = 15))
summary(anova_result)

                                 Df    Sum Sq   Mean Sq F value   Pr(>F)    
rep(c("A", "B", "C"), each = 15)  2 1.086e+09 543200000    22.7 2.07e-07 ***
Residuals                        42 1.005e+09  23923810                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The analysis of variance provides a p-value of 2.07e-07. Since this value is less than the significance level of 0.05, we can reject the null hypothesis and conclude that there is a significant difference in average monthly revenue between the three marketing strategies.

Mann-Whitney U Test

The Mann-Whitney U test is a non-parametric test used to compare the means of two independent groups when the data does not meet the normality or equal variance requirements required for Student’s t-test.

The Mann-Whitney U test applies when the following conditions are met:

The data is not normally distributed
The variances of the groups are not equal
The samples are independent

The Mann-Whitney U test is used to compare the means of two distinct groups, such as the average revenues of two different advertising campaigns.

Example Case: A company wants to compare the average revenues of two different advertising campaigns, A and B. Revenue data is collected from 15 stores for each campaign. Let’s verify if there is a significant difference between the two campaigns using the Mann-Whitney U test.

# Campaign A data
revenue_A <- c(12000, 15000, 10000, 13000, 11000, 14000, 12500, 13500, 11500, 14500, 12200, 13800, 11800, 12700, 13200)

# Campaign B data
revenue_B <- c(11000, 14000, 13000, 12000, 15000, 11500, 13500, 12500, 14500, 11800, 13200, 12700, 14200, 11600, 13800)

# Mann-Whitney U test
wilcox.test(revenue_A, revenue_B, alternative = "two.sided", correct = FALSE)

	Wilcoxon rank sum test

data:  revenue_A and revenue_B
W = 102.5, p-value = 0.6779
alternative hypothesis: true location shift is not equal to 0

The Mann-Whitney U test provides a p-value of 0.6779. Since this value is greater than the significance level of 0.05, we cannot reject the null hypothesis and cannot conclude that there is a significant difference in average revenues between the two advertising campaigns.

Fisher’s Exact Test

Fisher’s exact test is a non-parametric statistical test used to analyze the association between two categorical variables in 2×2 contingency tables, especially when sample sizes are small.

Fisher’s exact test applies when the following conditions are met:

The variables are categorical
The samples are independent
The sample sizes are small (one or more cells in the contingency table have expected values less than 5)

Fisher’s exact test is used to analyze the association between two categorical variables, such as the relationship between the use of a particular drug and the occurrence of a side effect.

Example Case: In a clinical study on a new drug for the treatment of hypertension, 15 patients who took the drug and 10 patients who took a placebo are observed. Of the 15 patients who took the drug, 3 experienced a side effect, while of the 10 patients who took the placebo, 1 experienced the side effect. Let’s verify if there is a significant association between taking the drug and the occurrence of the side effect using Fisher’s exact test.

# Data
side_effect <- c(3, 12, 1, 9)
dim(side_effect) <- c(2, 2)
rownames(side_effect) <- c("Drug", "Placebo")
colnames(side_effect) <- c("Side effect", "No side effect")
side_effect

        Side effect No side effect
Drug               3              1
Placebo           12              9

# Fisher's exact test
fisher.test(side_effect)

	Fisher's Exact Test for Count Data

data:  side_effect
p-value = 0.6265
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
   0.145667 130.928066
sample estimates:
odds ratio 
  2.183137

Fisher’s exact test provides a p-value of 0.6265. Since this value is greater than the significance level of 0.05, we cannot reject the null hypothesis and cannot conclude that there is a significant association between taking the drug and the occurrence of the side effect.

Regression Analysis

Regression analysis is a set of statistical techniques used to model the relationship between a dependent variable (or response variable) and one or more independent variables (or explanatory variables).

Regression analysis applies when the following conditions are met:

There is a linear relationship between the dependent variable and the independent variables
The residuals are normally distributed and homoscedastic (i.e., have constant variance)
The observations are independent

Regression analysis is used to understand the impact of different independent variables on a dependent variable, such as the effect of age, income, and education level on the consumption of a particular product category.

Example Case: A clothing company wants to analyze the impact of age, income, and education level on annual clothing consumption. Data is collected on a sample of 100 individuals. We use multiple linear regression analysis to model the relationship between annual clothing consumption (dependent variable) and age, income, and education level (independent variables).

# Data
consumption <- c(1200, 1500, 2000, 1800, 2200, 1700, 2100, 1900, 1600, 2300, 1400, 1800, 2100, 1700, 2000, 1600, 1900, 2200, 1500, 1800)
age <- c(25, 35, 42, 30, 38, 28, 45, 33, 27, 40, 22, 31, 39, 26, 37, 24, 32, 41, 29, 36)
income <- c(35000, 45000, 60000, 50000, 55000, 40000, 65000, 48000, 38000, 70000, 32000, 46000, 58000, 42000, 52000, 37000, 49000, 62000, 40000, 51000)
education <- c(2, 3, 4, 3, 4, 2, 4, 3, 2, 4, 2, 3, 4, 2, 3, 2, 3, 4, 3, 3)

# Multiple linear regression model
model <- lm(consumption ~ age + income + education)
summary(model)

Call:
lm(formula = consumption ~ age + income + education)

Residuals:
    Min      1Q  Median      3Q     Max 
-261.06  -93.14   39.80   66.26  223.24 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) 639.775078 165.734340   3.860  0.00139 **
age         -13.127175  14.699870  -0.893  0.38509   
income        0.030875   0.008645   3.571  0.00255 **
education    34.426950 107.969978   0.319  0.75396   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 128.1 on 16 degrees of freedom
Multiple R-squared:  0.8404,	Adjusted R-squared:  0.8105 
F-statistic: 28.08 on 3 and 16 DF,  p-value: 1.302e-06

Coefficients: The output shows the coefficients for each independent variable in the model. In this case, the independent variables are ‘age’, ‘income’, and ‘education’. The intercept coefficient is 639.775078.

Significance: The ‘income’ variable is statistically significant at the 5% significance level (since the p-value is less than 0.05), while the ‘age’ and ‘education’ variables are not. This suggests that only ‘income’ has a significant impact on ‘consumption’.

R-squared: The R-squared value is 0.8404, which indicates that about 84% of the variation in ‘consumption’ can be explained by the variables ‘age’, ‘income’, and ‘education’. However, the adjusted R-squared value is 0.8105, which suggests that when accounting for the number of independent variables in the model, about 81% of the variation in ‘consumption’ can be explained by these variables.

F-statistic: The F-statistic value is 28.08 with a p-value of 1.302e-06, which indicates that the overall model is statistically significant.

The model suggests that ‘income’ is the only significant predictor of ‘consumption’. However, the model as a whole is significant and explains a large portion of the variation in ‘consumption’.

An Overview: a Table

Statistical Test	Conditions of Applicability	Advantages	Disadvantages
Z Test	Large sample size (n > 30). Known population variance. Normally distributed data.	Simple to calculate and interpret. Suitable for large samples.	Requires knowledge of population variance. Not suitable for small samples.
Student’s t-Test	Small sample size (n < 30). Unknown population variance. Normally distributed data.	Suitable for small samples. Does not require knowledge of population variance.	Assumes normality of data.
Welch’s t-Test	Different sample sizes. Different variances. Normally distributed data.	Does not require the assumption of equal variances.	Assumes normality of data.
Chi-Square Test	Categorical variables. Independent samples. Expected frequencies > 5 per cell.	Suitable for categorical variables. Does not require assumptions about distribution.	Can be inaccurate if expected frequencies are too low.
ANOVA	Normally distributed data. Homoscedasticity (equal variances). Independent samples.	Allows comparison of more than two groups simultaneously.	Requires assumptions of normality and homoscedasticity.
Mann-Whitney U Test	Non-normally distributed data. Different variances. Independent samples.	Does not require assumptions about distribution or equality of variances.	Less powerful than parametric tests if assumptions are met.
Fisher’s Exact Test	Categorical variables. Independent samples. Small sample sizes.	Accurate for small samples. Suitable for 2×2 contingency tables.	Not suitable for large samples or larger contingency tables.
Regression Analysis	Linear relationship between variables. Normally distributed and homoscedastic residuals. Independent observations.	Allows modeling of the relationship between variables. Identifies significant predictors.	Requires assumptions about residuals and linearity.