Naive Bayes
Naive Bayes
Suppose our training set consists of data samples \(D = \{(\mathbf{x}_1, y_1), ...., (\mathbf{x}_N, y_N)\}, \; \mathbf{x_i} \in \mathbb{R}^d\), where \(D = \{(\mathbf{x}_i, y_i)\}\) are realizations of a random sample that follows unknown joint distribution \(P(\mathbf{X}, Y)\).
Assumptions:
Features are conditionally independent (Naive bayes assumption): \[P(\mathbf{X} | Y) = \prod^{d}_{j=1} P(X_j | Y)\]
MLE assumption: Random sample is identically distributed.
Positional independence: The position of features does not matter (used in Multinomial case).
By applying bayes rule (applying on distribution \(P (\cdot)\) to make things general), we have:
\[P(Y | \mathbf{X}) = \frac{P(\mathbf{X}, Y)}{P(\mathbf{X})} = \frac{P(\mathbf{X} | Y) P(Y)}{P(\mathbf{X})}\]
By substituting the assumption:
\[P(Y | \mathbf{X}) = \frac{\prod^{d}_{j=1} P(X_j | Y)P(Y)}{P(\mathbf{X})}\]
Since the probability distribution \(P(\mathbf{X})\) characterised by \(F_{\mathbf{X}}(\mathbf{x})\) is constant for any given \(\mathbf{x}\), we can drop it from the equation because it only changes \(P(Y | \mathbf{X})\) by a proportion:
\[P(Y | \mathbf{X}) \propto P(Y) \prod^{d}_{j=1} P(X_j | Y)\]
Our goal is to find a class \(\hat{y}\) that maximize the probability given input \(\mathbf{X} = \mathbf{x}\):
\[\begin{aligned} \hat{y} = \underset{y}{\arg\max} \sum^{d}_{j=1} \log P_{X_j|Y}(x_j | y) + \log P_{Y}(y) \end{aligned}\]Bernoulli Naive Bayes
In order to solve for \(\hat{y}\), we need to first find \(\log P(X_i | Y)\) and \(\log P(Y)\). Assume that our features are discrete and takes values \(X_{j} \in \{0, 1\}\) and we have \(k\) classes, \(Y \in \{0, ..., k\}\). Then it is nature to also assume that the conditional distribution of one feature \(X_{j}\) given \(Y=y\) follows a bernoulli distribution with unknown parameters \(\mu_{jy}\). That is, for \(i\)th input (\(\mathbf{x}_i, y_i\)) the pmf of \(j\)th feature given \(y_i\) can be written as:
\[p_{X_j | Y} (x_{ij} | y_i; \mu_{j y_{i}}) = \mu_{j y_{i}}^{x_{ij}} (1 - \mu_{j y_{i}})^{1 - x_{ij}}\]
Besides parameter \(\mu_{jy_i}\), we have parameter \(\pi_{y_i}\) which is defined as the prior probability of class \(y_i\) (This is valid because this is part of the joint distribution):
\[\pi_{y_i} = P(Y=y_i)\]
Then, the likelihood function can be written as \(\boldsymbol{\theta} = <\boldsymbol{\mu, \; \pi}>\):
\[L(\boldsymbol{\theta}; D) = \prod^{N}_{i=1} P(\mathbf{X}, Y; \; \boldsymbol{\theta}) = \prod^{N}_{i=1} \pi_{y_i} \prod^{d}_{j=1} \mu_{j y_{i}}^{x_{ij}} (1 - \mu_{j y_{i}})^{1 - x_{ij}}\]
subject to constraint \(\sum^{K}_{i=1} \pi_{i} = 1\)
Then the log likelihood function:
\[l(\boldsymbol{\theta}; D) = \sum^{N}_{i=1} [\log (\pi_{y_i}) \sum^{d}_{j=1} x_{ij} \log (\mu_{j y_{i}}) + (1 - x_{ij})\log (1 - \mu_{j y_{i}})]\]
subject to constraint \(\sum^{K}_{i=1} \pi_{i} = 1\)
Taking partial derivative w.r.t \(\mu_{jk}\):
\[\begin{aligned} \frac{\partial l(\boldsymbol{\theta}; D)}{\partial \mu_{jk}} &= 0\\ \implies \sum^{n}_{i=1} I[y_i = k] (\frac{x_{ij}}{\mu_{jk}} - \frac{1 - x_{ij}}{1 - \mu_{jk}}) &= 0\\ \implies \sum^{n}_{i=1} I[y_i = k] \frac{x_{ij}}{\mu_{jk}} &= \sum^{n}_{i=1} I[y_i = k] \frac{1 - x_{ij}}{1 - \mu_{jk}}\\ \implies \sum^{n}_{i=1} I[y_i = k] (1 - \mu_{jk}) x_{ij} &= \sum^{n}_{i=1} I[y_i = k] (1 - x_{ij}) \mu_{jk}\\ \implies \hat{\mu}_{jk} &= \frac{\sum^{N}_{i=1} I[x_{ij} = 1 \text{ and } y_i = k]}{\sum^{N}_{i=1} I[y_i = k]} \end{aligned}\]Taking partial derivative w.r.t \(\pi_{k}\) and constraint \(\sum^{K}_{i=1} \pi_{i} = 1\) (Lagrange multiplier):
\[\frac{\partial l(\boldsymbol{\theta}; D)}{\partial \pi_{k}} + \lambda \frac{\partial \sum^{K}_{i=1} \pi_{i}}{\partial \pi_{k}} = 0\]
\[\implies \lambda = - \sum^{N}_{i=1} \frac{I[y_i = k]}{\pi_k}\]
\[\implies \pi_k = - \sum^{N}_{i=1} \frac{I[y_i = k]}{\lambda}\]
By substituting \(\lambda\) with \(\sum^{N}_{i=1} \pi_i = 1 \implies \lambda = -N\):
\[\hat{\pi}_k = \sum^{N}_{i=1} \frac{I[y_i = k]}{N}\]
The result is same if we assume \(Y\) follows Multinomial distribution with constraint on \(p\) and \(n = 1\).
Multinomial Naive Bayes
In multinomial case, our features are counts \(X_j \in \{0, ....., M\}\).
The derivation of Multinomial NB is similar to Bernoulli case, the only difference is that, in multinomial, we assume the conditional distribution of \(\mathbf{X}\) given \(Y=y\) follows a multinomial distribution. That is, the conditional pmf for \(i\)th input \((\mathbf{x}_i | y_i)\) is defined as:
\[p_{\mathbf{X} | Y} (\mathbf{x}_i | y_i) = \frac{M_i!}{x_{i1}! .... x_{id}!} \prod^{d}_{j=1} \mu^{x_{ij}}_{jy_i}\]
Where, \(M_{i}\) represents the total counts for sample \(i\).
By positional independence:
\[p_{\mathbf{X} | Y} (\mathbf{x}_i | y_i) = \prod^{d}_{j=1} \mu^{x_{ij}}_{jy_i}\]
Then, we can write the likelihood function as:
\[L(\boldsymbol{\theta}; D) = \prod^{N}_{i=1} P(\mathbf{X}, Y; \; \boldsymbol{\theta}) = \prod^{N}_{i=1} \pi_{y_i} \prod^{d}_{j=1} \mu^{x_{ij}}_{jy_i}\]
subject to constraints:
\[\sum^{K}_{i=1} \pi_i = 1\]
\[\sum^{d}_{j=1} \mu_{jk} = 1\]
\[\sum^{d}_{j=1} x_{ij} = M_i\]
By solving the contrained maximization problem, we have the estimates:
\[\hat{\mu}_{jk} = \frac{\sum^{N}_{i=1} I[y_i = k]x_{ij} + l}{\sum^{N}_{i=1} I[y_i = k] \sum^{d}_{\beta=1} x_{i\beta} + ld}\]
\[\hat{\pi}_k = \sum^{N}_{i=1} \frac{I[y_i = k]}{N}\]
Categorical Naive Bayes
In some cases, features \(X_j \in \{1, ...., K_{j}\}\) are nominal and have no explicit meaning for their numerical values. Thus, in this case, we can model the pmf of \(j\)th feature of \(i\)th example given \(Y=y_i\) by categorical pmf:
\[p_{X_j|Y} (x_{ij} | y_i) = \prod^{K_{j}}_{k=1} \mu_{jy_i k}^{I[x_{ij} = k]}\]
The likelihood function is therefore:
\[L(\boldsymbol{\theta}; D) = \prod^{N}_{i=1} P(\mathbf{X}, Y; \; \boldsymbol{\theta}) = \prod^{N}_{i=1} \pi_{y_i} \prod^{d}_{j=1}\prod^{K_{j}}_{k=1} \mu_{jy_i k}^{I[x_{ij} = k]}\]
subject to constraints:
\[\sum^{K}_{i=1} \pi_i = 1\]
\[\sum^{K_{j}}_{k=1} \mu_{ijk} = 1\]
The estimates are:
\[\hat{\mu}_{jck} = \frac{\sum^{N}_{i=1} I[y_i = c] I[x_{ij} = k] + l}{\sum^{N}_{i=1} I[y_i = c] + lK_{j}}\]
Gaussian Naive Bayes
Suppose features are continuously distributed \(X_j \in \mathbb{R}\), then we can use gaussian distribution to model the conditional distribution of feature:
\[f_{X_{j} | Y} (x_{ij} | Y=y_i ;\; \mu_{jy_i}, \sigma_{jy_i}) = N(\mu_{jy_i}, \sigma_{jy_i})\]
Then by following similar procedure, we have ML estimates:
\[\hat{\mu}_{jk} = \frac{1}{n_k} \sum^{N}_{i=1} I[y_i = k] x_{ij}\]
\[\hat{\sigma}^2_{jk} = \frac{1}{n_k} \sum^{N}_{i=1} I[y_i = k] (x_{ij} - \hat{\mu}_{jk})^2\]
Where \(n_c = \frac{1}{n_k} \sum^{N}_{i=1} I[y_i = k]\)
Naive Bayes as Linear Classifier
Implementation
1 |
|
Ref
http://www.cs.toronto.edu/~zemel/documents/2515/Tutorial4-2515-nb-gbc.pdf
https://mattshomepage.com/articles/2016/Jun/26/multinomial_nb/
https://classes.cec.wustl.edu/~SEAS-SVC-CSE517A/sp20/lecturenotes/06_lecturenote_NB.pdf