# Naive Bayes

Naive Bayesian is a classifier using [[Bayes' Theorem]] Bayes' Theorem Bayes' Theorem is stated as $$ P(A\mid B) = \frac{P(B \mid A) P(A)}{P(B)} $$ $P(A\mid B)$: likelihood of A given B $P(A)$: marginal probability of A There is a nice tree diagram for the Bayes' theorem on Wikipedia. Tree diagram of Bayes' theorem with ‘naive’ assumptions.

## Problems with Conditional Probability Calculation

By definition, the conditional probability of event $\mathbf Y$ given features $\mathbf X$ is

$$ \begin{equation} P(\mathbf Y\mid \mathbf X) = \frac{P(\mathbf Y, \mathbf X)}{ P(\mathbf X) }, \label{def-cp-y-given-x} \end{equation} $$

where- $P(\mathbf X)$ is probability of an event having the features $\mathbf X$,
- $P(\mathbf Y, \mathbf X)$ is the probability of the event $Y$ with features $\mathbf X$.

In equation $\eqref{def-cp-y-given-x}$, the estimation of $P(\mathbf X)$ is not easy. Imagine the size of the space spanned by 10 features. It is a 10-dimensional space and a lot of combinations. This is usually not accurate in many limited datasets. An accurate estimation of the probability of one specific combination requires a large dataset with a lot of occurrences of events with all kinds of feature combinations.

It is the same situation for the estimation of $P(\mathbf X \mid \mathbf Y)$ and $P(\mathbf Y, \mathbf X)$.

On the other hand, the conditional probability of event $Y$ given a set of features $\mathbf X$ can also be calculated using the Bayes' theorem,

$$ \begin{equation} P(\mathbf Y\mid \mathbf X) = \frac{P(\mathbf Y) P(\mathbf X \mid \mathbf Y)}{ P(\mathbf X) }, \label{cp-by-bayes-theorem-init} \end{equation} $$

where

- $P(\mathbf Y)$ is the probability of event $Y$.

In equation $\eqref{cp-by-bayes-theorem-init}$, we will be calculating $P(\mathbf X\mid \mathbf Y)$ instead of $P( \mathbf Y\mid \mathbf X)$ which will be better defined in a small dataset.

## Naive Bayes

Suppose we are solving a classification problem, with features denoted as $\mathbf X$, and class results as $\mathbf Y$. We would like to train a classifier for the class results given some feature values. Bayes' theorem tells us the probability

$$ \begin{equation} P(\mathbf Y \mid \mathbf X) = \frac{ P(\mathbf X \mid \mathbf Y) P(\mathbf Y) }{ P(\mathbf X) }. \end{equation} $$

Being “naive”, we will assume that the features are independent of each other, i.e., don’t have interactions with each other in terms of predictions. In this case, we simply write down the theorem as

$$ \begin{equation} P(\mathbf Y \mid \mathbf X) = \frac{ P(\mathbf Y) \prod_i P(X_i \mid \mathbf Y) }{ \prod_i P(X_1) } \propto P(\mathbf Y) \prod_i P(X_i \mid \mathbf Y) . \label{eq-naive-approximation} \end{equation} $$

We do not care about $\prod_i P(X_1)$ because it only serves as a normalization factor. Besides, it could be hard to calculate in some cases.

## Log-Likelihood

In Eq. $\eqref{eq-naive-approximation}$, we have a bunch of probabilities multiplied together. Probabilities are no larger than 1 so this expression is usually tiny. It is not our computer’s biggest strength to deal with tiny numbers. So we will simply place a log on both sides of the equation in order to work with normal numbers. After taking the log, the products becomes sums. This also makes it easy to deal with the terms.

$$ \log \left( P(\mathbf Y \mid \mathbf X) \right) = \log \left( P(\mathbf Y) \right) + \sum_i \log \left( P(X_i \mid \mathbf Y) \right) + \mathrm{Const.}. $$

## Other Topics

- Laplace Correction
- Continuous Values for $\mathbf Y$: Gaussian Naive Bayes, etc

L Ma (2019). 'Naive Bayes', Datumorphism, 06 April. Available at: https://datumorphism.leima.is/wiki/machine-learning/bayesian/naive-bayes/.