Naive Bayes

Naive Bayes

Suppose our training set consists of data samples D={(x1,y1),....,(xN,yN)},xiRd, where D={(xi,yi)} are realizations of a random sample that follows unknown joint distribution P(X,Y).

Assumptions:

  1. Features are conditionally independent (Naive bayes assumption): P(X|Y)=j=1dP(Xj|Y)

  2. MLE assumption: Random sample is identically distributed.

  3. Positional independence: The position of features does not matter (used in Multinomial case).

By applying bayes rule (applying on distribution P() to make things general), we have:

P(Y|X)=P(X,Y)P(X)=P(X|Y)P(Y)P(X)

By substituting the assumption:

P(Y|X)=j=1dP(Xj|Y)P(Y)P(X)

Since the probability distribution P(X) characterised by FX(x) is constant for any given x, we can drop it from the equation because it only changes P(Y|X) by a proportion:

P(Y|X)P(Y)j=1dP(Xj|Y)

Our goal is to find a class y^ that maximize the probability given input X=x:

y^=argmaxyj=1dlogPXj|Y(xj|y)+logPY(y)

Bernoulli Naive Bayes

In order to solve for y^, we need to first find logP(Xi|Y) and logP(Y). Assume that our features are discrete and takes values Xj{0,1} and we have k classes, Y{0,...,k}. Then it is nature to also assume that the conditional distribution of one feature Xj given Y=y follows a bernoulli distribution with unknown parameters μjy. That is, for ith input (xi,yi) the pmf of jth feature given yi can be written as:

pXj|Y(xij|yi;μjyi)=μjyixij(1μjyi)1xij

Besides parameter μjyi, we have parameter πyi which is defined as the prior probability of class yi (This is valid because this is part of the joint distribution):

πyi=P(Y=yi)

Then, the likelihood function can be written as θ=<μ,π>:

L(θ;D)=i=1NP(X,Y;θ)=i=1Nπyij=1dμjyixij(1μjyi)1xij

subject to constraint i=1Kπi=1

Then the log likelihood function:

l(θ;D)=i=1N[log(πyi)j=1dxijlog(μjyi)+(1xij)log(1μjyi)]

subject to constraint i=1Kπi=1

Taking partial derivative w.r.t μjk:

l(θ;D)μjk=0i=1nI[yi=k](xijμjk1xij1μjk)=0i=1nI[yi=k]xijμjk=i=1nI[yi=k]1xij1μjki=1nI[yi=k](1μjk)xij=i=1nI[yi=k](1xij)μjkμ^jk=i=1NI[xij=1 and yi=k]i=1NI[yi=k]

Taking partial derivative w.r.t πk and constraint i=1Kπi=1 (Lagrange multiplier):

l(θ;D)πk+λi=1Kπiπk=0

λ=i=1NI[yi=k]πk

πk=i=1NI[yi=k]λ

By substituting λ with i=1Nπi=1λ=N:

π^k=i=1NI[yi=k]N

The result is same if we assume Y follows Multinomial distribution with constraint on p and n=1.

Multinomial Naive Bayes

In multinomial case, our features are counts Xj{0,.....,M}.

The derivation of Multinomial NB is similar to Bernoulli case, the only difference is that, in multinomial, we assume the conditional distribution of X given Y=y follows a multinomial distribution. That is, the conditional pmf for ith input (xi|yi) is defined as:

pX|Y(xi|yi)=Mi!xi1!....xid!j=1dμjyixij

Where, Mi represents the total counts for sample i.

By positional independence:

pX|Y(xi|yi)=j=1dμjyixij

Then, we can write the likelihood function as:

L(θ;D)=i=1NP(X,Y;θ)=i=1Nπyij=1dμjyixij

subject to constraints:

i=1Kπi=1

j=1dμjk=1

j=1dxij=Mi

By solving the contrained maximization problem, we have the estimates:

μ^jk=i=1NI[yi=k]xij+li=1NI[yi=k]β=1dxiβ+ld

π^k=i=1NI[yi=k]N

Categorical Naive Bayes

In some cases, features Xj{1,....,Kj} are nominal and have no explicit meaning for their numerical values. Thus, in this case, we can model the pmf of jth feature of ith example given Y=yi by categorical pmf:

pXj|Y(xij|yi)=k=1KjμjyikI[xij=k]

The likelihood function is therefore:

L(θ;D)=i=1NP(X,Y;θ)=i=1Nπyij=1dk=1KjμjyikI[xij=k]

subject to constraints:

i=1Kπi=1

k=1Kjμijk=1

The estimates are:

μ^jck=i=1NI[yi=c]I[xij=k]+li=1NI[yi=c]+lKj

Gaussian Naive Bayes

Suppose features are continuously distributed XjR, then we can use gaussian distribution to model the conditional distribution of feature:

fXj|Y(xij|Y=yi;μjyi,σjyi)=N(μjyi,σjyi)

Then by following similar procedure, we have ML estimates:

μ^jk=1nki=1NI[yi=k]xij

σ^jk2=1nki=1NI[yi=k](xijμ^jk)2

Where nc=1nki=1NI[yi=k]

Naive Bayes as Linear Classifier

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

import pandas as pd
import numpy as np

class MultinomialNB:

def __init__(self, alpha=1):
self.alpha = alpha
self.theta = None
self.class_prior = None
self.class_map = None

def fit(self, X, y):
if isinstance(X, pd.DataFrame):
X = X.values

if isinstance(y, pd.Series):
y = y.values

c_ = np.unique(y)
n_c = c_.size
NT, N_d = X.shape

self.theta = np.zeros((n_c, N_d))
self.class_prior = np.array(np.unique(y, return_counts=True), dtype=np.float64).T
self.class_prior[:, 1] = self.class_prior[:, 1] / NT
self.class_map = {}

for index, c in enumerate(c_):
N_c = X[y == c].sum()
self.class_map[index] = c
for i in range(N_d):
N_ci = X[y == c, i].sum()
self.theta[index, i] = np.log((N_ci + self.alpha) / (N_c + self.alpha * N_d))

return self


def predict(self, X):
if isinstance(X, pd.DataFrame):
X = X.values

pred_results = np.zeros(X.shape[0])
class_size = len(self.class_map)

for index, i in enumerate(X):
temp_lst = np.zeros(class_size)
for c in range(class_size):
temp_lst[c] = np.log(self.class_prior[:, 1][c])
for d in range(X.shape[1]):
temp_lst[c] = temp_lst[c] + self.theta[c][d] * i[d]

pred_results[index] = self.class_map[np.argmax(temp_lst)]

return pred_results

Ref

http://www.cs.toronto.edu/~zemel/documents/2515/Tutorial4-2515-nb-gbc.pdf

https://mattshomepage.com/articles/2016/Jun/26/multinomial_nb/

https://classes.cec.wustl.edu/~SEAS-SVC-CSE517A/sp20/lecturenotes/06_lecturenote_NB.pdf