Chi-square Correlation Test for Nominal Data

In this article, we will discuss the chi-square correlation test for detecting correlations between two series.

Steps

  1. Find out all the possible values of the two nominal series A and B;
  2. Count the co-occurrences of the combinations (A, B);
  3. Calculate the expected co-occurrences of the combinations (A, B);
  4. Calculate chi-square;
  5. Determine whether the hypothesis can be rejected.

Define the Series

Suppose we are analyzing two series A and B. Series A can take values $a_1$ and $a_2$, while series B can take values $b_1$ , $b_2$ and $b_3$.

$$ \begin{align} A &:= \{a1, a2\} \\ B &:= \{b1,b2,b3\} \end{align} $$

As an example, we will use the following A and B series for our calculations in this article.

indexAB
1a1b2
2a1b2
3a1b1
4a2b1
5a2b3
6a2b2
7a1b2
8a2b2

Count Co-ocurrences

To analyze correlations between the two series, we need to look at whether the values of series A and those of series B would occur together. For example, we would like to know the possibility of values for B if we have $a_1$ occurred.

One of the extreme examples is that A and B are exactly the same. In this case, we would know that the value for B is always the same as A for each row. Then we would know that all the possible combinations of (A, B) are

  1. (a1, a1)
  2. (a2, a2)

We could construct the occurrence table.

a1a2
a1number of occurrences0
a20number of occurrences

Now we construct a contigency table to denote the ocurrences of the values, (A, B).

a1a2
b111
b232
b301

where the cells are filled with the number of occurrences of the corresponding combinations. For example, the combination (a1, b1) occurred once, thus 1 in the first row first column.

This table records the observed frequencies, which we denote as table O and each cell is denoted as $o_{ij}$.

This table tells us about the possible correlations already. Imagine we have two columns that are exactly the same, we would have a table that have large number of occurrences on the diagonal elements.
However, those numbers in the table depend on the number of rows that we have in our original table. To find the actual correlation, we need to normalize it. We could simply divide everything by the total number of rows in the original table. But Pearson had a better idea.

Pearson’s chi-square correlation is a smart idea.

First of all, we define an expectation table E. Each element of E is calculated as

$$ e_{ij} = \frac{ \text{number of } a_i * \text{ number of } b_j }{ \text{ total number of rows in original table } } $$

This $e_{ij}$ serves as the average occurrence of each combinations of $a_i$ and $b_j$. If we have $a_i$ in each row but only one $b_j$ occurrences, the average is 1. This mean that given $b_j$ we would definitely only see one $a_i$.

When we have multiple $a_i$ and $b_j$, this average still works. Suppose we have $a_1$ occurred 4 times in total and we have a total of 8 rows. Assuming that this $a_1$ will appear randomly in the rows, what is the average probability to see this $a_1$ if we choose a random row? It is $4/8=0.5$. Then we will expect $1\times 0.5=0.5$ occurrences $a_1$ for one occurance of $b_2$. If we have 3 occurances of $b_2$, we would expect to see $3\times 0.5==1.5$ occurrences of $a_1$.

This is why it is treated as expected frequencies of each combinations.

As a side note, suppose we have A and B exactly the same, and they all have the same values, a1.

indexAB
1a1b1
2a1b1
3a1b1
4a1b1
5a1b1
6a1b1
7a1b1
8a1b1

Then we expect $e_{11} = n$, where $n=8$ is the number of rows.

Now if we compare the original table with this one,

$$ o_{ij} - e_{ij} $$

we get the deviation from the expected table. With a few little twitches, we would define

$$ \chi^2 = \sum_{i,j} \frac{ (o_{ij} - e_{ij})^2 } { e_{ij} } $$

If A and B are the same and each possible values occurred m times, then we would have

$$ o_{ij} = e_{ij} = \delta_{ij} * m. $$

Then we get $$ \chi^2 = 0 $$

Then we say this chi-square analysis doesn’t reject our hypothesis that these two columns are correlated.

How to Use the Number Chi-square

The final question is how to use the result. We usually have a threshold $\chi_0^2$. Whenever our calculated value is larger than this one, we decide that our analysis rejects the hypothesis that the two columns are correlated. This value $\chi_0^2$ can be found in the textbooks.

Other Methods

  1. Kendall rank correlation coefficient
  2. Spearman’s rank correlation coefficient

Planted: by ;

L Ma (2018). 'Chi-square Correlation Test for Nominal Data', Datumorphism, 11 April. Available at: https://datumorphism.leima.is/wiki/statistics/correlation-analysis-chi-square/.