# Correlation Coefficient and Covariance for Numeric Data

## Covariances

### Review of Standard Deviation

For a series of data A, we have the standard deviations

$$ \sigma_A = \sqrt{ \frac{ \sum (a_i - \bar A)^2 }{ n } }, $$

where $n$ is the number of elements in series A.

The standard deviation is very easy to understand. It is basically the average Eucleadian distance between the data points and the average value. In this article, we will take another point of view.

Now imagine we have two series $(a_i - \bar A)$ and $(a_j - \bar A)$. The geometric mean squared for $i=j$ is

$$ M_i^2 = (a_i - \bar A)^2. $$

From this point of view, the standard deviation is in fact a measure of the mean of **geometric mean of the deviation of each element**.

## Standard Deviation of the Sample

If we are dealing with a sample instead of the whole population, the standard deviation should be defined as $$ \sigma_A = \sqrt{ \frac{ \sum (a_i - \bar A)^2 }{ n - 1 } }. $$

Why the $n-1$? We could easily understand this by taking extreme cases. Suppose we have only 1 sample data point, the standard deviation knowledge that we can infer should be infinite since we have no idea what the standard deviation is.

### Generalize Standard Deviation to Covariances

Knowledge card: Covariance matrix.

Similarly, for two series A and B of the same length, we could define a quantity to measure the geometric mean of the deviation of the two series correspondingly

$$ \sigma_{A,B}^2 = \frac{ \sum (a_i - \bar A) (b_i - \bar B) }{ n }, $$

which is named the covariance of A and B, i.e., $\text{Cov} ({A,B})$.

It is easy to show that

$$ \mathrm{Cov}({A,B}) = E( A,B ) - \bar A \bar B. $$

Covariance measures the correlation of these two series. To see this, we assume that we have two series A = B, which leads to $\sigma_{A,B} = \sigma_{A}$. Suppose we have two series at a completely opposite phase,

index | A | B |
---|---|---|

1 | 1 | -1 |

2 | -1 | 1 |

3 | 1 | -1 |

4 | -1 | 1 |

5 | 1 | -1 |

6 | -1 | 1 |

7 | 1 | -1 |

we have $\sigma_{A,B} = -1 $. The negative sign tells us that our series are anti-correlated.

## Correlation Coefficient

However, we would find that the value of the covariance depends on the values of the standard deviation of each series, which makes it hard to determine how strong the correlation is.

The obvious normalization factor is the multiplication of covariance of the two series, $\sigma_A$ and $\sigma_B$, i.e.,

$$ r_{A,B} = \frac{ \text{Cov}(A,B) } { \sigma_{A} \sigma_{B} } = \frac{ \sum_i (a_i - \bar A) (b_i - \bar B) }{ n\sigma_{A} \sigma_{B} } $$

$$ r_{A,B} = \frac{ \sum_{i} ( \text{Sign}(a_i - \bar A) M_{i}^a ) ( \text{Sign}(b_i - \bar B) M_i^b ) }{ \sum_i { M_i^a } \sum_j { M_j^b } } $$

which is some kind of geometric mean of the geometric mean of each series.

L Ma (2018). 'Correlation Coefficient and Covariance for Numeric Data', Datumorphism, 11 April. Available at: https://datumorphism.leima.is/wiki/statistics/correlation-coefficient/.