## Linear Regression and Likelihood

The linear estimator $y$ is

$$$$y^n = \beta^m X_m^{\phantom{m}n}. \label{eq-linear-model}$$$$

As usual, we have redefined our data to get rid of the intercept $\beta^0$.

In ordinary linear models, we find the error being the difference between the target $\hat y$ and the estimator $y$

$$\epsilon = \hat y - y,$$

which is required to have a minimum absolute value.

In linear regressions, we use least squares to solve the problem. In Bayesian linear regression, instead of using a deterministic estimator $\beta^m X_m^{\phantom{m}n}$, we assume a Gaussian random estimator

$$$$\mathcal{N}(\mu, \sigma^2) = \mathcal{N}(\beta^m X_m^{\phantom{m}n}, \sigma^2),$$$$

where we have used the knowledge of linear regression, that the mean of the estimator should be a linear model $\beta^m X_m^{\phantom{m}n}$. The likelihood becomes

$$$$P(\hat y^n \mid [X_m^{\phantom{m}n}, \beta^m] ) = \frac{1}{\sqrt{2 \sigma^2 \pi}} \exp \left( -\frac{(\hat{y^n} - \beta^m X_m^{\phantom{m}n})^2}{2 \sigma^2} \right)$$$$

It is not surprising that requiring the maximum likelihood will lead to the same result as the least-squares since the log operation takes out the exponential operation.

## Bayesian Linear Model

Applying Bayes’ theorem to this problem,

$$P( [X_m^{\phantom{m}n}, \beta^m] \mid \hat y^n ) {\color{red}P(\hat y^n)} = P(\hat y^n \mid [X_m^{\phantom{m}n}, \beta^m] ) P([X_m^{\phantom{m}n}, \beta^m]).$$

Since ${\color{red}P(\hat y^n)}$ doesn’t depend on the parameters and is a constant, we will ignore it for the sake of optimization.

$$P( [X_m^{\phantom{m}n}, \beta^m] \mid \hat y^n ) \propto P(\hat y^n \mid [X_m^{\phantom{m}n}, \beta^m] ) P([X_m^{\phantom{m}n}, \beta^m]).$$

Fall back to Maximum Likelihood

If $$P([X_m^{\phantom{m}n}, \beta^m]) = 1.$$

We will assume a least information model for $P([X_m^{\phantom{m}n}, \beta^m])$, that is

$$P([X_m^{\phantom{m}n}, \beta^m]) = \mathcal{N} (0, \sigma_\beta^2) = \frac{1}{\sqrt{2 \sigma_\beta^2 \pi}} \exp \left( -\frac{(\beta^m )^2}{2 \sigma_\beta^2} \right).$$

Our posterior becomes

$$\log P( [X_m^{\phantom{m}n}, \beta^m] \mid \hat y^n ) = -\frac{(\hat{y^n} - \beta^m X_m^{\phantom{m}n})^2}{2 \sigma^2} -\frac{(\beta^m )^2}{2 \sigma_\beta^2} + \mathrm{Const.}$$

This is nothing but Ridge loss with coefficient $\lambda$, where

$$\frac{\sigma^2}{\sigma_\beta^2} = \lambda.$$

Published: by ;