statistics

The Chi-Square Test: Goodness of Fit and Test of Independence

In previous posts, we have seen different types of tests that we can use to analyze our data and test hypotheses.

The chi-square test was proposed by Karl Pearson in 1900, and it is widely used to estimate how effectively the distribution of a categorical variable represents an expected distribution (in this case, we talk about the “Goodness of Fit Test”) or to estimate when two categorical variables are independent of each other (and then we talk about the “Test of Independence”).

Such is the importance and widespread use of this test that it was listed by the magazine Scientific American among the 20 most important scientific discoveries of the 20th century.


The Goodness of Fit Test

This is a very useful test, concerning the distribution of a categorical variable. It allows us to verify if the observed frequencies differ significantly from the expected frequencies when there are more than two possible outcomes.

The prerequisites for carrying out the test are very simple:

  1. The sample must be random;
  2. Observations must be independent for the sample (one observation per subject);
  3. No observed value in each class should be less than 5.
    This last point sounds rather cryptic and deserves a few more words. When the variable is continuous or the characters are not nominal and individual sample observations are available, an important issue concerns determining the number of classes (also called “cells”) into which the distribution is divided. In practice, it is required that the theoretical frequencies are at least equal to 5; that is, it is necessary to verify that the number of elements observed in each class is not less than a minimum threshold.

Understanding Through a Simple Example

As usual, to better understand what we are talking about, we will explain it with a super-simplified (and, I apologize, quite ridiculous…) example.

Suppose a study was conducted on electronics hobbyists who use Arduino boards. It was found that 50% own only one Arduino board, 30% have 2 to 4 boards, and 20% own 5 or more.

Let’s imagine that I conducted my own independent study and found these data: out of 150 hobbyists, I found that 90 owned only one Arduino, 30 had 2 to 4 boards, and 30 had 5 or more boards.

The null hypothesis is that the proportions I found are in line with those of the official study.
The alternative hypothesis is obviously that the collected data do not confirm the proportions of the official study.

I prepare my table by entering the data:

One Arduino2 to 4 boards5 or more boardsTotal
Observed Data90
3030150
Expected Data0.50 x 150 = 750.30 x 150 = 45 0.20 x 150 = 30 150

To accept the null hypothesis, the difference between the expected and observed frequencies must be attributable to sampling variability at the designated level of significance.

The χ2 statistic calculated from the sample data is given by:

\( \chi^2=\Sigma\frac{(f_0-f_e)^2}{f_e}\ \ \)

f0=observed frequencies
fe=expected frequencies

The degrees of freedom for the goodness of fit tests are:

\( df=(r-1)(c-1)\ \ \ \)

r = number of rows in the contingency table
c = number of columns in the contingency table

Let’s use our example as guidelines. We start from the hypotheses:

\( H_0=the\ frequencies\ are\ 0.5\ 0.3\ 0.2\ H_a=the\ frequencies\ are\ not\ 0.5\ 0.3\ 0.2\ \)

We have:

\( n=150\\ df=(2-1)(3-1)=2\\ \\ \)

We find the critical χ2 value in the tables (df=2, α=0.05)
The value is: 5.99

Now I calculate the χ2 value for my data:

\( \chi^2=\frac{(90-75)^2}{75}+\frac{(30-45)^2}{45}+\frac{(30-30)^2}{30}=\ =\frac{225}{75}+\frac{225}{45}+\frac{0}{30}=\ =3+5\ =8\ \)

We conclude then (since the calculated value is higher than the critical value) that we can reject the null hypothesis at the 5% significance level. That is, we can reject the assertion that the frequencies are distributed according to the proportion 50%, 30%, 20%.

Making Life Easier with a Casio Scientific Calculator

With my fx calculator, I just need to choose “STAT” from the menu and enter the observed values in list L1 and the expected values in L2 in my table editor.

Then I will choose:

[TEST]
[CHI]
[GoF]
Observed:List1
Expected:List2
df:2
[CALC]

and I will get both the chi-square value and the p-value (in this case, 0.01832, which is less than the alpha value of 0.05 I chose, confirming the conclusion that I can reject the null hypothesis and accept the alternative one).

Using R for the Goodness of Fit Test

In R, the example given is even easier to set up:

observed<-c(90,30,30)
expected_proportion<-c(0.5,0.3,0.2)
chisq.test(observed,p=expected_proportion,correct=FALSE)

and the result will be:

Chi-squared test for given probabilities
data: observed
X-squared = 8, df = 2, p-value = 0.01832

The Test of Independence

It is commonly used to determine if two factors are related to each other.

Generally, what we want to know is: “Is variable X independent of variable Y?”

Note: the answer we get from our test is only this, not how the variables are related.

In the case of the goodness of fit test, there is only one variable at play: the observed frequencies can therefore be listed in a single row, or column, of values in a table.

Tests of independence, on the other hand, involve two variables, and the object of the test is precisely the assumption that the two variables are statistically independent.

Since two variables are involved in the test, the observed frequencies are entered into a contingency table of the row x column type.
For example, I represent the data relating to the age and gender of enthusiasts of a given commercial brand:

AgeMaleFemaleTotal
<356654120
>=35781290
Total14466210

We want to test the null hypothesis that the two qualitative variables, gender and age, are independent. Therefore, the alternative hypothesis predicts that there is a relationship between the two variables.

If the hypothesis of independence is true, between the observed frequency of each cell and the total of the observed frequencies of the row and column in which that cell is included, there must be the same proportions existing between the column and row totals and the total sample size.

\( f_e=\frac{\Sigma_{row}\ \Sigma_{column}}{n}\ \ \ df=(r-1)(c-1)\ \ \ \)

At this point, I proceed with my example:

\( f_e=\frac{\Sigma_{row}\ \Sigma_{column}}{n}=\frac{120\times 144}{210}=82,3\ \)

The 3 remaining frequencies can be easily obtained by subtraction from the row and column totals. In fact, a 2×2 table has df=1, meaning that the frequency of only one cell is free to vary.

I will get:

AgeMaleFemaleTotal
<358238120
>=35622890
Total14466210

\( H_0=gender\ and\ age\ are\ independent\ H_a=there\ is\ a\ relationship\ between\ gender\ and\ age\ \ \ df=(2-1)(2-1)=1 \)

I choose a significance level of α=0.01

\( \chi^2_{critical}=6.63\ \)

I calculate the chi-square value and find:

\( \chi^2=23.9\ \)

Therefore, the null hypothesis of independence is rejected at the 1% significance level. The variables age and gender are dependent.

The Test of Independence with Casio

To solve my example very easily with my Casio, I could have done this:

I load my table data into a matrix, which I call A:

[[66,54][78,12]]→[OPTN][MAT][MAT][ALPHA][A]

At this point, I move to the statistical functions:

[MENU][STAT]

[TEST][CHI][2WAY]

Observed:Mat A

Expected:Mat B

[CALC]
The result will be:

χ2=23.9299242
p=9.9907e-07
df=1

As can be seen from the very low p-value, I accept the alternative hypothesis and reject the null hypothesis.

The Test of Independence with R

I build my contingency table

enthusiasts <- matrix(c(66,54,78,12),ncol=2,byrow=TRUE)
rownames(enthusiasts) <- c("less than 35","35 or more")
colnames(enthusiasts) <- c("male","female")
enthusiasts <- as.table(enthusiasts)
enthusiasts

I can calculate the row totals:
margin.table(enthusiasts,1)

and the column totals:
margin.table(enthusiasts,2)

the grand total is:
margin.table(enthusiasts)

I look at the expected values:
chisq.test(enthusiasts)$expected

and test the hypothesis with:
chisq.test(enthusiasts)

The resulting very low p-value indicates that I can reject the null hypothesis of independence of the two variables.

paolo

Recent Posts

Guide to Statistical Tests for A/B Analysis

Statistical tests are fundamental tools for data analysis and informed decision-making. Choosing the appropriate test…

9 months ago

How to Use Decision Trees to Classify Data

Decision Trees are a type of machine learning algorithm that uses a tree structure to…

11 months ago

The Gradient Descent Algorithm Explained Simply

Imagine wanting to find the fastest route to reach a destination by car. You could…

1 year ago

The Monte Carlo Method Explained Simply with Real-World Applications

Monte Carlo simulation is a method used to quantify the risk associated with a decision-making…

2 years ago

The Hypergeometric Distribution

We have seen that the binomial distribution is based on the hypothesis of an infinite…

2 years ago

The Negative Binomial Distribution (or Pascal Distribution)

The negative binomial distribution describes the number of trials needed to achieve a certain number…

2 years ago