Logistic Regression

In a classification problem, given a list of features values $x$ and their corresponding classes $\{c_i\}$, the posterior for of the classes, aka conditional probability of the classes, is

$$ p(C=c_i\mid X=x). $$


The likelihood of the data is

$$ p(X=x\mid C=c_i). $$

Logistic Regression for Two Classes

For two classes, the simplest model for the posterior is a linear model,

$$ \log \frac{p(C=c_1\mid X=x) }{p(C=c_2\mid X=x)} = \beta_0 + \beta_1 \cdot x, $$

which is equivalent to

$$ p(C=c_1\mid X=x) = \exp\left(\beta_0 + \beta_1 \cdot x\right) p(C=c_2\mid X=x) . $$


The reason that we proposing a linear model for the quantity

$$ \log \frac{p(C=c_1\mid X=x) }{p(C=c_2\mid X=x)}, $$

is that it has a range from $-\infty$ to $\infty$ which matches the range of the linear model $ \beta_0 + \beta_1 \cdot x$.

We can also see in the following results that such relation guarantees that the conditional probabilities are restricted to 0 to 1 after applying the normalization constraint.

Using the normalization condition

$$ p(C=c_1\mid X=x) + p(C=c_2\mid X=x) = 1, $$

we can derive the posterior for each classes

$$ \begin{align} p(C=c_2\mid X=x) &= \frac{1}{1 + \exp\left(\beta_0 + \beta_1 \cdot x\right)} \\ p(C=c_1\mid X=x) &= \frac{\exp\left(\beta_0 + \beta_1 \cdot x\right)}{1 + \exp\left(\beta_0 + \beta_1 \cdot x\right)}. \end{align} $$

For simplicity, we are using $x'=\beta_0 + \beta_1 \cdot x$ in this figure.
The two conditional probabilities

For simplicity, we are using $x'=\beta_0 + \beta_1 \cdot x$ in this figure.

This is the [[sigmoid function]] Uni-Polar Sigmoid Uni-polar sigmoid function and its properties

Limiting behavior

  1. As $\beta_0 + \beta_1 \cdot x \to \infty$, we have $p(C=c_2\mid X=x) \to 0$ and $p(C=c_1\mid X=x)\to 1$.
  2. As $\beta_0 + \beta_1 \cdot x \to 0$, we have $p(C=c_2\mid X=x) \to 0.5$ and $p(C=c_1\mid X=x)\to 0.5$.
  3. As $\beta_0 + \beta_1 \cdot x \to -\infty$, we have $p(C=c_2\mid X=x) \to 1$ and $p(C=c_1\mid X=x)\to 0$.

Relation to Cross Entropy

For two classes, we can write down the likelihood as

$$ \pi_{i=1}^{N} p^{y_i} p^{1-y_i}, $$

where $p$ is the probability of label $y_i=c_1$ and $1-p$ is probability of label $y_i=c_2$.

Taking the neglog, we find that

$$ -l = sum_{i=1}^N ( -y_i \log p - (1-y_i)\log (1-p) ). $$

This is the [[cross entropy]] Cross Entropy Cross entropy is1 $$ H(p, q) = \mathbb E_{p} \left[ -\log q \right]. $$ Cross entropy $H(p, q)$ can also be decomposed, $$ H(p, q) = H(p) + \operatorname{D}_{\mathrm{KL}} \left( p \parallel q \right), $$ where $H(p)$ is the [[entropy of $P$]] Shannon Entropy Shannon entropy $S$ is the expectation of information content $I(X)=-\log \left(p\right)$1, \begin{equation} H(p) = \mathbb E_{p}\left[ -\log \left(p\right) \right]. \end{equation} shannon_entropy_wiki Contributors to Wikimedia projects. …

Logistic Regression for $K$ Classes

It is easily generalized to problems with $K$ classes.

$$ \begin{align} p(C=c_K\mid X=x) &= \frac{1}{1 + \sum_k\exp\left(\beta_{k0} + \beta_k \cdot x\right)} \\ p(C=c_k\mid X=x) &= \frac{\exp\left(\beta_{k0} + \beta_k \cdot x\right)}{1 + \sum_k\exp\left(\beta_{k0} + \beta_k \cdot x\right)} \end{align} $$

Why not non-linear

The log of the posterior ratio can be more complex than linear models. In general, we have1

$$ \log \frac{p(C=c_1\mid X=x) }{p(C=c_2\mid X=x)} = f(x), $$

so that

$$ p(C=c_1\mid X=x) = \frac{\exp(f(x))}{ 1 + \exp(f(x)) }. $$

The logistic regression model we mentioned in the previous sections require

$$ f(x) = \beta_0 + \beta_1 \cdot x. $$

A more general additive model is

$$ f(x) = \sum_i f_i(x), $$

where we can apply algorithms such as local scoring to fit such models1.

  1. friedman2000 Friedman J, Hastie T, Tibshirani R. Additive Logistic Regression. The Annals of Statistics. 2000. pp. 337–374. doi:10.1214/aos/1016218223  ↩︎

Planted: by ;

L Ma (2021). 'Logistic Regression', Datumorphism, 05 April. Available at: https://datumorphism.leima.is/wiki/machine-learning/linear/logistic-regression/.