Cross Entropy

Cross entropy is¹

$$ H(p, q) = \mathbb E_{p} \left[ -\log q \right]. $$

Cross entropy $H(p, q)$ can also be decomposed,

$$ H(p, q) = H(p) + \operatorname{D}_{\mathrm{KL}} \left( p \parallel q \right), $$

where $H(p)$ is the [[entropy of $P$]] Shannon Entropy Shannon entropy $S$ is the expectation of information content $I(X)=-\log \left(p\right)$1, \begin{equation} H(p) = \mathbb E_{p}\left[ -\log \left(p\right) \right]. \end{equation} shannon_entropy_wiki Contributors to Wikimedia projects. Entropy (information theory). In: Wikipedia [Internet]. 29 Aug 2021 [cited 4 Sep 2021]. Available: https://en.wikipedia.org/wiki/Entropy_(information_theory) ↩︎ and $\operatorname{D}_{\mathrm{KL}}$ is the [[KL Divergence]] KL Divergence Kullback–Leibler divergence indicates the differences between two distributions .

Cross entropy is widely used in classification problems, e.g., [[logistic regression]] Logistic Regression logistics regression is a simple model for classification ².

Binary Cross Entropy

For dataset with 2 classes ($0$ and $1$) in the target, we denote the true label probability is $p$, and the predicted probability is $q$. For example, $q_{y=1}$ denotes the probability of predicted label being $1$.

$$ \begin{align*} H(p, q) =& - p_{y=0} \log (q_{\hat y=0}) - p_{y=1} \log (q_{\hat y=1}) \\ =& - p_{y=0} \log (q_{\hat y=0}) - (1 - p_{y=0}) \log ( 1 - q_{\hat y=0} ) \end{align*} $$

For $y\in \{0,1\}$, we have

$$ H(p, q) = \begin{cases} - \log (q_{\hat y=0}) , & \text{for } y=0 \\ - \log ( 1 - q_{\hat y=0} ) , & \text{for } y=1. \end{cases} $$

Combining the two expressions, we can simply use the following formula,

$$ H(p, q) = - y \log (q_{\hat y=0}) - y \log ( 1 - q_{\hat y=0} ). $$

The two probabilities of $q_{\hat y=0}$ and $q_{\hat y=1}$ can be predicted by a model.