Naive Bayes

Posted on 2021-07-19 Edited on 2021-07-24 In ML Views:

Symbols count in article: 7.2k Reading time ≈ 7 mins.

Naive Bayes

Suppose our training set consists of data samples \(D = \{(\mathbf{x}_1, y_1), ...., (\mathbf{x}_N, y_N)\}, \; \mathbf{x_i} \in \mathbb{R}^d\), where \(D = \{(\mathbf{x}_i, y_i)\}\) are realizations of a random sample that follows unknown joint distribution \(P(\mathbf{X}, Y)\).

Assumptions:

Features are conditionally independent (Naive bayes assumption): \[P(\mathbf{X} | Y) = \prod^{d}_{j=1} P(X_j | Y)\]
MLE assumption: Random sample is identically distributed.
Positional independence: The position of features does not matter (used in Multinomial case).

By applying bayes rule (applying on distribution \(P (\cdot)\) to make things general), we have:

\[P(Y | \mathbf{X}) = \frac{P(\mathbf{X}, Y)}{P(\mathbf{X})} = \frac{P(\mathbf{X} | Y) P(Y)}{P(\mathbf{X})}\]

By substituting the assumption:

\[P(Y | \mathbf{X}) = \frac{\prod^{d}_{j=1} P(X_j | Y)P(Y)}{P(\mathbf{X})}\]

Since the probability distribution \(P(\mathbf{X})\) characterised by \(F_{\mathbf{X}}(\mathbf{x})\) is constant for any given \(\mathbf{x}\), we can drop it from the equation because it only changes \(P(Y | \mathbf{X})\) by a proportion:

\[P(Y | \mathbf{X}) \propto P(Y) \prod^{d}_{j=1} P(X_j | Y)\]

Our goal is to find a class \(\hat{y}\) that maximize the probability given input \(\mathbf{X} = \mathbf{x}\):

\[\begin{aligned} \hat{y} = \underset{y}{\arg\max} \sum^{d}_{j=1} \log P_{X_j|Y}(x_j | y) + \log P_{Y}(y) \end{aligned}\]

Bernoulli Naive Bayes

In order to solve for \(\hat{y}\), we need to first find \(\log P(X_i | Y)\) and \(\log P(Y)\). Assume that our features are discrete and takes values \(X_{j} \in \{0, 1\}\) and we have \(k\) classes, \(Y \in \{0, ..., k\}\). Then it is nature to also assume that the conditional distribution of one feature \(X_{j}\) given \(Y=y\) follows a bernoulli distribution with unknown parameters \(\mu_{jy}\). That is, for \(i\)th input (\(\mathbf{x}_i, y_i\)) the pmf of \(j\)th feature given \(y_i\) can be written as:

\[p_{X_j | Y} (x_{ij} | y_i; \mu_{j y_{i}}) = \mu_{j y_{i}}^{x_{ij}} (1 - \mu_{j y_{i}})^{1 - x_{ij}}\]

Besides parameter \(\mu_{jy_i}\), we have parameter \(\pi_{y_i}\) which is defined as the prior probability of class \(y_i\) (This is valid because this is part of the joint distribution):

\[\pi_{y_i} = P(Y=y_i)\]

Then, the likelihood function can be written as \(\boldsymbol{\theta} = <\boldsymbol{\mu, \; \pi}>\):

\[L(\boldsymbol{\theta}; D) = \prod^{N}_{i=1} P(\mathbf{X}, Y; \; \boldsymbol{\theta}) = \prod^{N}_{i=1} \pi_{y_i} \prod^{d}_{j=1} \mu_{j y_{i}}^{x_{ij}} (1 - \mu_{j y_{i}})^{1 - x_{ij}}\]

subject to constraint \(\sum^{K}_{i=1} \pi_{i} = 1\)

Then the log likelihood function:

\[l(\boldsymbol{\theta}; D) = \sum^{N}_{i=1} [\log (\pi_{y_i}) \sum^{d}_{j=1} x_{ij} \log (\mu_{j y_{i}}) + (1 - x_{ij})\log (1 - \mu_{j y_{i}})]\]

subject to constraint \(\sum^{K}_{i=1} \pi_{i} = 1\)

Taking partial derivative w.r.t \(\mu_{jk}\):

\[\begin{aligned} \frac{\partial l(\boldsymbol{\theta}; D)}{\partial \mu_{jk}} &= 0\\ \implies \sum^{n}_{i=1} I[y_i = k] (\frac{x_{ij}}{\mu_{jk}} - \frac{1 - x_{ij}}{1 - \mu_{jk}}) &= 0\\ \implies \sum^{n}_{i=1} I[y_i = k] \frac{x_{ij}}{\mu_{jk}} &= \sum^{n}_{i=1} I[y_i = k] \frac{1 - x_{ij}}{1 - \mu_{jk}}\\ \implies \sum^{n}_{i=1} I[y_i = k] (1 - \mu_{jk}) x_{ij} &= \sum^{n}_{i=1} I[y_i = k] (1 - x_{ij}) \mu_{jk}\\ \implies \hat{\mu}_{jk} &= \frac{\sum^{N}_{i=1} I[x_{ij} = 1 \text{ and } y_i = k]}{\sum^{N}_{i=1} I[y_i = k]} \end{aligned}\]

Taking partial derivative w.r.t \(\pi_{k}\) and constraint \(\sum^{K}_{i=1} \pi_{i} = 1\) (Lagrange multiplier):

\[\frac{\partial l(\boldsymbol{\theta}; D)}{\partial \pi_{k}} + \lambda \frac{\partial \sum^{K}_{i=1} \pi_{i}}{\partial \pi_{k}} = 0\]

\[\implies \lambda = - \sum^{N}_{i=1} \frac{I[y_i = k]}{\pi_k}\]

\[\implies \pi_k = - \sum^{N}_{i=1} \frac{I[y_i = k]}{\lambda}\]

By substituting \(\lambda\) with \(\sum^{N}_{i=1} \pi_i = 1 \implies \lambda = -N\):

\[\hat{\pi}_k = \sum^{N}_{i=1} \frac{I[y_i = k]}{N}\]

The result is same if we assume \(Y\) follows Multinomial distribution with constraint on \(p\) and \(n = 1\).

Multinomial Naive Bayes

In multinomial case, our features are counts \(X_j \in \{0, ....., M\}\).

The derivation of Multinomial NB is similar to Bernoulli case, the only difference is that, in multinomial, we assume the conditional distribution of \(\mathbf{X}\) given \(Y=y\) follows a multinomial distribution. That is, the conditional pmf for \(i\)th input \((\mathbf{x}_i | y_i)\) is defined as:

\[p_{\mathbf{X} | Y} (\mathbf{x}_i | y_i) = \frac{M_i!}{x_{i1}! .... x_{id}!} \prod^{d}_{j=1} \mu^{x_{ij}}_{jy_i}\]

Where, \(M_{i}\) represents the total counts for sample \(i\).

By positional independence:

\[p_{\mathbf{X} | Y} (\mathbf{x}_i | y_i) = \prod^{d}_{j=1} \mu^{x_{ij}}_{jy_i}\]

Then, we can write the likelihood function as:

\[L(\boldsymbol{\theta}; D) = \prod^{N}_{i=1} P(\mathbf{X}, Y; \; \boldsymbol{\theta}) = \prod^{N}_{i=1} \pi_{y_i} \prod^{d}_{j=1} \mu^{x_{ij}}_{jy_i}\]

subject to constraints:

\[\sum^{K}_{i=1} \pi_i = 1\]

\[\sum^{d}_{j=1} \mu_{jk} = 1\]

\[\sum^{d}_{j=1} x_{ij} = M_i\]

By solving the contrained maximization problem, we have the estimates:

\[\hat{\mu}_{jk} = \frac{\sum^{N}_{i=1} I[y_i = k]x_{ij} + l}{\sum^{N}_{i=1} I[y_i = k] \sum^{d}_{\beta=1} x_{i\beta} + ld}\]

\[\hat{\pi}_k = \sum^{N}_{i=1} \frac{I[y_i = k]}{N}\]

Categorical Naive Bayes

In some cases, features \(X_j \in \{1, ...., K_{j}\}\) are nominal and have no explicit meaning for their numerical values. Thus, in this case, we can model the pmf of \(j\)th feature of \(i\)th example given \(Y=y_i\) by categorical pmf:

\[p_{X_j|Y} (x_{ij} | y_i) = \prod^{K_{j}}_{k=1} \mu_{jy_i k}^{I[x_{ij} = k]}\]

The likelihood function is therefore:

\[L(\boldsymbol{\theta}; D) = \prod^{N}_{i=1} P(\mathbf{X}, Y; \; \boldsymbol{\theta}) = \prod^{N}_{i=1} \pi_{y_i} \prod^{d}_{j=1}\prod^{K_{j}}_{k=1} \mu_{jy_i k}^{I[x_{ij} = k]}\]

subject to constraints:

\[\sum^{K}_{i=1} \pi_i = 1\]

\[\sum^{K_{j}}_{k=1} \mu_{ijk} = 1\]

The estimates are:

\[\hat{\mu}_{jck} = \frac{\sum^{N}_{i=1} I[y_i = c] I[x_{ij} = k] + l}{\sum^{N}_{i=1} I[y_i = c] + lK_{j}}\]

Gaussian Naive Bayes

Suppose features are continuously distributed \(X_j \in \mathbb{R}\), then we can use gaussian distribution to model the conditional distribution of feature:

\[f_{X_{j} | Y} (x_{ij} | Y=y_i ;\; \mu_{jy_i}, \sigma_{jy_i}) = N(\mu_{jy_i}, \sigma_{jy_i})\]

Then by following similar procedure, we have ML estimates:

\[\hat{\mu}_{jk} = \frac{1}{n_k} \sum^{N}_{i=1} I[y_i = k] x_{ij}\]

\[\hat{\sigma}^2_{jk} = \frac{1}{n_k} \sum^{N}_{i=1} I[y_i = k] (x_{ij} - \hat{\mu}_{jk})^2\]

Where \(n_c = \frac{1}{n_k} \sum^{N}_{i=1} I[y_i = k]\)

Naive Bayes as Linear Classifier

Implementation


import pandas as pd
import numpy as np

class MultinomialNB:

    def __init__(self, alpha=1):
        self.alpha = alpha
        self.theta = None
        self.class_prior = None
        self.class_map = None

    def fit(self, X, y):
        if isinstance(X, pd.DataFrame):
            X = X.values

        if isinstance(y, pd.Series):
            y = y.values

        c_ = np.unique(y)
        n_c = c_.size
        NT, N_d = X.shape

        self.theta = np.zeros((n_c, N_d))
        self.class_prior = np.array(np.unique(y, return_counts=True), dtype=np.float64).T
        self.class_prior[:, 1] = self.class_prior[:, 1] / NT
        self.class_map = {}
        
        for index, c in enumerate(c_):
            N_c = X[y == c].sum()
            self.class_map[index] = c
            for i in range(N_d):
                N_ci = X[y == c, i].sum()
                self.theta[index, i] = np.log((N_ci + self.alpha) / (N_c + self.alpha * N_d))

        return self


    def predict(self, X):
        if isinstance(X, pd.DataFrame):
            X = X.values

        pred_results = np.zeros(X.shape[0])
        class_size = len(self.class_map)
        
        for index, i in enumerate(X):
            temp_lst = np.zeros(class_size)
            for c in range(class_size):
                temp_lst[c] = np.log(self.class_prior[:, 1][c])
                for d in range(X.shape[1]):
                    temp_lst[c] = temp_lst[c] + self.theta[c][d] * i[d]

            pred_results[index] = self.class_map[np.argmax(temp_lst)]

        return pred_results

Ref

http://www.cs.toronto.edu/~zemel/documents/2515/Tutorial4-2515-nb-gbc.pdf

https://mattshomepage.com/articles/2016/Jun/26/multinomial_nb/

https://classes.cec.wustl.edu/~SEAS-SVC-CSE517A/sp20/lecturenotes/06_lecturenote_NB.pdf