In previous posts, we have seen different types of tests that we can use to analyze our data and test hypotheses.
The chi-square test was proposed by Karl Pearson in 1900, and it is widely used to estimate how effectively the distribution of a categorical variable represents an expected distribution (in this case, we talk about the “Goodness of Fit Test”) or to estimate when two categorical variables are independent of each other (and then we talk about the “Test of Independence”).
Such is the importance and widespread use of this test that it was listed by the magazine Scientific American among the 20 most important scientific discoveries of the 20th century.
The Goodness of Fit Test
This is a very useful test, concerning the distribution of a categorical variable. It allows us to verify if the observed frequencies differ significantly from the expected frequencies when there are more than two possible outcomes.
The prerequisites for carrying out the test are very simple:
- The sample must be random;
- Observations must be independent for the sample (one observation per subject);
- No observed value in each class should be less than 5.
This last point sounds rather cryptic and deserves a few more words. When the variable is continuous or the characters are not nominal and individual sample observations are available, an important issue concerns determining the number of classes (also called “cells”) into which the distribution is divided. In practice, it is required that the theoretical frequencies are at least equal to 5; that is, it is necessary to verify that the number of elements observed in each class is not less than a minimum threshold.
Understanding Through a Simple Example
As usual, to better understand what we are talking about, we will explain it with a super-simplified (and, I apologize, quite ridiculous…) example.
Suppose a study was conducted on electronics hobbyists who use Arduino boards. It was found that 50% own only one Arduino board, 30% have 2 to 4 boards, and 20% own 5 or more.
Let’s imagine that I conducted my own independent study and found these data: out of 150 hobbyists, I found that 90 owned only one Arduino, 30 had 2 to 4 boards, and 30 had 5 or more boards.
The null hypothesis is that the proportions I found are in line with those of the official study.
The alternative hypothesis is obviously that the collected data do not confirm the proportions of the official study.
I prepare my table by entering the data:
One Arduino | 2 to 4 boards | 5 or more boards | Total | |
Observed Data | 90 | 30 | 30 | 150 |
Expected Data | 0.50 x 150 = 75 | 0.30 x 150 = 45 | 0.20 x 150 = 30 | 150 |
To accept the null hypothesis, the difference between the expected and observed frequencies must be attributable to sampling variability at the designated level of significance.
The χ2 statistic calculated from the sample data is given by:
\( \chi^2=\Sigma\frac{(f_0-f_e)^2}{f_e}\ \ \)
f0=observed frequencies
fe=expected frequencies
The degrees of freedom for the goodness of fit tests are:
\( df=(r-1)(c-1)\ \ \ \)
r = number of rows in the contingency table
c = number of columns in the contingency table
Let’s use our example as guidelines. We start from the hypotheses:
\( H_0=the\ frequencies\ are\ 0.5\ 0.3\ 0.2\ H_a=the\ frequencies\ are\ not\ 0.5\ 0.3\ 0.2\ \)We have:
\( n=150\\ df=(2-1)(3-1)=2\\ \\ \)We find the critical χ2 value in the tables (df=2, α=0.05)
The value is: 5.99
Now I calculate the χ2 value for my data:
\( \chi^2=\frac{(90-75)^2}{75}+\frac{(30-45)^2}{45}+\frac{(30-30)^2}{30}=\ =\frac{225}{75}+\frac{225}{45}+\frac{0}{30}=\ =3+5\ =8\ \)We conclude then (since the calculated value is higher than the critical value) that we can reject the null hypothesis at the 5% significance level. That is, we can reject the assertion that the frequencies are distributed according to the proportion 50%, 30%, 20%.
Making Life Easier with a Casio Scientific Calculator
With my fx calculator, I just need to choose “STAT” from the menu and enter the observed values in list L1 and the expected values in L2 in my table editor.
Then I will choose:
[TEST]
[CHI]
[GoF]
Observed:List1
Expected:List2
df:2
[CALC]
and I will get both the chi-square value and the p-value (in this case, 0.01832, which is less than the alpha value of 0.05 I chose, confirming the conclusion that I can reject the null hypothesis and accept the alternative one).
Using R for the Goodness of Fit Test
In R, the example given is even easier to set up:
observed<-c(90,30,30) expected_proportion<-c(0.5,0.3,0.2) chisq.test(observed,p=expected_proportion,correct=FALSE) and the result will be: Chi-squared test for given probabilities data: observed X-squared = 8, df = 2, p-value = 0.01832
The Test of Independence
It is commonly used to determine if two factors are related to each other.
Generally, what we want to know is: “Is variable X independent of variable Y?”
Note: the answer we get from our test is only this, not how the variables are related.
In the case of the goodness of fit test, there is only one variable at play: the observed frequencies can therefore be listed in a single row, or column, of values in a table.
Tests of independence, on the other hand, involve two variables, and the object of the test is precisely the assumption that the two variables are statistically independent.
Since two variables are involved in the test, the observed frequencies are entered into a contingency table of the row x column type.
For example, I represent the data relating to the age and gender of enthusiasts of a given commercial brand:
Age | Male | Female | Total |
<35 | 66 | 54 | 120 |
>=35 | 78 | 12 | 90 |
Total | 144 | 66 | 210 |
We want to test the null hypothesis that the two qualitative variables, gender and age, are independent. Therefore, the alternative hypothesis predicts that there is a relationship between the two variables.
If the hypothesis of independence is true, between the observed frequency of each cell and the total of the observed frequencies of the row and column in which that cell is included, there must be the same proportions existing between the column and row totals and the total sample size.
\( f_e=\frac{\Sigma_{row}\ \Sigma_{column}}{n}\ \ \ df=(r-1)(c-1)\ \ \ \)At this point, I proceed with my example:
\( f_e=\frac{\Sigma_{row}\ \Sigma_{column}}{n}=\frac{120\times 144}{210}=82,3\ \)The 3 remaining frequencies can be easily obtained by subtraction from the row and column totals. In fact, a 2×2 table has df=1, meaning that the frequency of only one cell is free to vary.
I will get:
Age | Male | Female | Total |
<35 | 82 | 38 | 120 |
>=35 | 62 | 28 | 90 |
Total | 144 | 66 | 210 |
\( H_0=gender\ and\ age\ are\ independent\ H_a=there\ is\ a\ relationship\ between\ gender\ and\ age\ \ \ df=(2-1)(2-1)=1 \)
I choose a significance level of α=0.01
\( \chi^2_{critical}=6.63\ \)I calculate the chi-square value and find:
\( \chi^2=23.9\ \)Therefore, the null hypothesis of independence is rejected at the 1% significance level. The variables age and gender are dependent.
The Test of Independence with Casio
To solve my example very easily with my Casio, I could have done this:
I load my table data into a matrix, which I call A:
[[66,54][78,12]]→[OPTN][MAT][MAT][ALPHA][A]
At this point, I move to the statistical functions:
[MENU][STAT] [TEST][CHI][2WAY] Observed:Mat A Expected:Mat B [CALC]The result will be:
χ2=23.9299242
p=9.9907e-07
df=1
As can be seen from the very low p-value, I accept the alternative hypothesis and reject the null hypothesis.
The Test of Independence with R
I build my contingency table
enthusiasts <- matrix(c(66,54,78,12),ncol=2,byrow=TRUE) rownames(enthusiasts) <- c("less than 35","35 or more") colnames(enthusiasts) <- c("male","female") enthusiasts <- as.table(enthusiasts) enthusiasts I can calculate the row totals: margin.table(enthusiasts,1) and the column totals: margin.table(enthusiasts,2) the grand total is: margin.table(enthusiasts) I look at the expected values: chisq.test(enthusiasts)$expected and test the hypothesis with: chisq.test(enthusiasts)
The resulting very low p-value indicates that I can reject the null hypothesis of independence of the two variables.