Correlation Coefficient and Covariance for Numeric Data


Correlation coefficient is also known as the Pearson’s product moment coefficient.

Review of Standard Deviation

For a series of data A, we have the standard deviations

$$ \sigma_A = \sqrt{ \frac{ \sum (a_i - \bar A)^2 }{ n } }, $$

where $n$ is the number of elements in series A.

The standard deviation is very easy to understand. It is basically the average Eucleadian distance between the data points and the average value. In this article, we will take another point of view.

Now imagine we have two series $(a_i - \bar A)$ and $(a_j - \bar A)$. The geometric mean squared for $i=j$ is

$$ M_i^2 = (a_i - \bar A)^2. $$

From this point of view, the standard deviation is in fact a measure of the mean of geometric mean of the deviation of each element.

Standard Deviation of the Sample

If we are dealing with a sample instead of the whole population, the standard deviation should be defined as $$ \sigma_A = \sqrt{ \frac{ \sum (a_i - \bar A)^2 }{ n - 1 } }. $$

Why the $n-1$? We could easily understand this by taking extreme cases. Suppose we have only 1 sample data point, the standard deviation knowledge that we can infer should be infinite since we have no idea what the standard deviation is.

Generalize Standard Deviation to Covariances

Knowledge card: Covariance matrix.

Similarly, for two series A and B of the same length, we could define a quantity to measure the geometric mean of the deviation of the two series correspondingly

$$ \sigma_{A,B}^2 = \frac{ \sum (a_i - \bar A) (b_i - \bar B) }{ n }, $$

which is named the covariance of A and B, i.e., $\text{Cov} ({A,B})$.

It is easy to show that

$$ \mathrm{Cov}({A,B}) = E( A,B ) - \bar A \bar B. $$

At first glance, the square in the definition seems to be only for notation purpose at this point.

Meanwhile, using this idea of the mean of geometric mean, we could easily generalize it to the covariance of three series,

$$ \sigma_{A,B,C}^3 = \frac{ \sum (a_i - \bar A) (b_i - \bar B)(c_i - \bar C) }{ n }, $$

or even arbitrary N series,

$$ \sigma_{A_1, A_2, …, A_N }^N = \frac{ \sum_{i=1}^{n} \text{ geometric mean of the ith elements to the Nth power } }{ n } = \frac{ \sum (a_{1,i} - \bar A_1) \cdots (a_{N,i} - \bar A_{N})}{ n }, $$

which should be called the covariance of all the N series, $\mathrm{Cov} ({A_1, A_2,\cdots, A_N })$.

Of course, we do not use these since we could easily build a covariance matrix to indicate all the possible covariances between any two variables, for example,

$$ \mathbf{C} = \begin{pmatrix} \mathrm{Cov} (A_1, A_1) & \mathrm{Cov} (A_1, A_2) \\ \mathrm{Cov} (A_2, A_1) & \mathrm{Cov} (A_2, A_2) \end{pmatrix} $$

Covariance measures the correlation of these two series. To see this, we assume that we have two series A = B, which leads to $\sigma_{A,B} = \sigma_{A}$. Suppose we have two series at a completely opposite phase,


we have $\sigma_{A,B} = -1 $. The negative sign tells us that our series are anti-correlated.

Correlation Coefficient

However, we would find that the value of the covariance depends on the values of the standard deviation of each series, which makes it hard to determine how strong the correlation is.

The obvious normalization factor is the multiplication of covariance of the two series, $\sigma_A$ and $\sigma_B$, i.e.,

$$ r_{A,B} = \frac{ \text{Cov}(A,B) } { \sigma_{A} \sigma_{B} } = \frac{ \sum_i (a_i - \bar A) (b_i - \bar B) }{ n\sigma_{A} \sigma_{B} } $$

$$ r_{A,B} = \frac{ \sum_{i} ( \text{Sign}(a_i - \bar A) M_{i}^a ) ( \text{Sign}(b_i - \bar B) M_i^b ) }{ \sum_i { M_i^a } \sum_j { M_j^b } } $$

which is some kind of geometric mean of the geometric mean of each series.

Planted: by ;

L Ma (2018). 'Correlation Coefficient and Covariance for Numeric Data', Datumorphism, 11 April. Available at: