Logistic Regression

Logistic Regression

Suppose we have training examples \(D = \{(\mathbf{x}_1, y_1), ...., (\mathbf{x}_N, y_N); \; \mathbf{x}_i \in \mathbb{R}^d\}\), our goal is to make decision about the class of new input \(\mathbf{x}\). The logistic regression does this by learning from a training set, a vector of bias and a matrix of weights.

Binary-Class

In binary class problem, our target \(Y\) takes values \(\{0, 1\}\). To model the distribution \(P(Y | \mathbf{X}; \; \mathbf{w}, b)\), we apply sigmoid function on the dot product of weights and inputs which transform the output to a value between \([0, 1]\) (one criteria for probability):

\[z = \mathbf{x}^T \mathbf{w} + b\]

\[y = \sigma(z)\]

To make sure that class random variable \(Y\)'s conditional pmf sums to 1:

\[P(Y=1 | X=\mathbf{x} ;\; \mathbf{w}, b) = \frac{1}{1 + e^{-z}} = p\]

\[P(Y=0 | X=\mathbf{x} ;\; \mathbf{w}, b) = 1 - \frac{1}{1 + e^{-z}} = \frac{e^{-z}}{1 + e^{-z}} = 1 - p\]

Then, it is equivalently to express this conditional pmf as Bernoulli pmf:

\[p_{Y|\mathbf{X}} (y | \mathbf{x}; \; \mathbf{w}, b) = p^y + (1 - p)^{1 - y}\]

If we have the conditional pmf of \(Y\) given \(X= \mathbf{x}\), then we can use simple decision rule to make decisions:

\[ \hat{y} = \begin{cases} P(Y=1 | X=\mathbf{x}) > 0.5, \quad 1\\ P(Y=1 | X=\mathbf{x}) \leq 0.5, \quad 0 \end{cases} \]

Learning Parameters

Given the dataset \(D\), the conditional likelihood function can be written as:

\[L(\boldsymbol{\theta}; \; D) = \prod^{N}_{i=1} P( y_i | \mathbf{x}_i ; \; \mathbf{\theta})\]

Where \(\boldsymbol{\theta} = <\mathbf{w}, b>\). In general, we could absorb \(b\) in \(\mathbf{w}\) by having an extra \(w_0\) term and append \(1\) as one feature at the beginning of each feature vector \(\mathbf{x}_i\). Our log likelihood becomes:

\[\begin{aligned} l(\boldsymbol{\theta}; \; D) &= \sum^{N}_{i=1} y_i\log (p_i) + (1 - y_i) \log(1 - p_i)\\ &= \sum^{N}_{i=1} y_i\log (\frac{1}{1 + e^{-z_i}}) + (1 - y_i) \log(\frac{e^{-z_i}}{1 + e^{-z_i}})\\ &= \sum^{N}_{i=1} -y_i\log (1 + e^{-z}) + (1 - y_i)z_i - \log(1 + e^{-z_i}) + y_i\log (1 + e^{-z})\\ &= \sum^{N}_{i=1} y_i \mathbf{w}^T\mathbf{x}_i - \log (1 + e^{\mathbf{w}^T\mathbf{x}_i}) \end{aligned}\]

By taking the negative sign in front of this log-likelihood, we have cross-entropy loss of logistic regression.

The partial derivative is then:

\[\frac{l(\boldsymbol{\theta}; \; D)}{\partial w_j} = \sum^{N}_{i=1}(y_i - p_i) x_{ij}\]

We are not going to solve for the parameters by taking gradient and setting it to zero because the equations are non-linear in \(\mathbf{w}\). Thus, one way to do this is to apply gradient decent on the negative of log likelihood (cross-entropy loss).

Multi-Class

In multi-class case, we have \(Y = \{1, ...., K\}\). Instead with one set of parameters \(\boldsymbol \theta = \mathbf{w}\), we now have one set of parameters for each class (total \(K-1\)) \(\;\) \(\boldsymbol \theta = W_{[K \times d]}\). Then we can define the condition pmf for \(i\)th sample using softmax function as:

\[P(Y=y_i | \mathbf{X} = \mathbf{x}_i) = \frac{e^{z_{iy_i}}}{\sum^{K}_{j=1} e^{z_{ij}}} = p_{iy_i}, \quad 1 \leq y_i \leq K\]

This is equivalent to the pmf of multinomial distribution with \(n=1\), which is the categorical distribution:

\[p_{Y | \mathbf{x}}(y_i | \mathbf{x}_i) = \prod^{K}_{j=1} p_{iy_i}^{I[y_i = j]}\]

Then the log likelihood is:

\[l(\boldsymbol{\theta}; \; D) = \sum^{N}_{i=1} \sum^{K}_{j=1} I[y_i = j] \log (p_{iy_i})\]

By taking the negative sign in front of this log-likelihood, we have multi-class cross-entropy loss of logistic regression.

If we are using the one-hot encoding for the class \(y_i\), then we can replace the log likelihood by:

\[l(\boldsymbol{\theta}; \; D) = \sum^{N}_{i=1} y_i \log (p_{iy_i})\]

The partial derivative is then:

\[\frac{l(\boldsymbol{\theta}; \; D)}{\partial w_{jd}} = \sum^{N}_{i=1} (I[y_i = j] - p_{iy_i}) x_{id}\]

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np

class LogisticRegression:

def __init__(self, optimizer, c=1, max_iter=2000, lr=0.01, tor=0.0000001, verbose=False):
self.lr = lr
self.c = c
self.max_iter = max_iter
self.optimizer = optimizer
self.tor = tor
self.verbose = verbose

self.weights = None
self.k = None
self.d = None

def fit(self, x_train, y_train):
n_y, self.k = y_train.shape
x_train = np.column_stack([np.ones(n_y), x_train])
n_x, self.d = x_train.shape
i = 0

if not self.weights:
self.weights = np.random.randn(self.d, self.k)

opt = self.optimizer(lr=self.lr, model_vars=[self.weights])

prev_matrix = 0
dif = np.linalg.norm(self.weights - prev_matrix)

if self.k > 1:
while (i <= self.max_iter) and (dif >= self.tor):
prev_matrix = self.weights
if self.verbose and (i % self.verbose == 0):
print(f'iteration: {i}, loss: {self._cal_train_loss(x_train, y_train)}')

self.weights = opt([self._mc_ce_grad(x_train, y_train)])[0]
dif = np.linalg.norm(self.weights - prev_matrix)
i += 1

def predict(self, x_test):
x_test = np.column_stack([np.ones(x_test.shape[0]), x_test])
pred = [np.argmax(self._cal_softmax(x_i)) for x_i in x_test]

return pred


def _mc_ce_grad(self, x, labels):
n = x.shape[0]
output_g = np.zeros((self.d, self.k))

for i, x_i in enumerate(x):
soft_max = self._cal_softmax(x_i)
temp_g = np.repeat(x_i.reshape(-1, 1), self.k, axis=1) * (soft_max - labels[i])
output_g += temp_g

return (output_g + self.c * self.weights)/ n

def _cal_softmax(self, x):
bot = 0
top = []

for k in range(self.k):
ea = np.exp(np.dot(self.weights[:, k], x))
bot += ea
top.append(ea)

return np.array(top / bot)

@staticmethod
def cross_entropy(y_pred, y_true):
loss = 0
for i, v in enumerate(y_true):
y_pred_i = np.log(y_pred[i])
loss += np.dot(y_pred_i, v.T)

return -loss / len(y_true)

def _cal_train_loss(self, x_train, y_train):
y_pred = self._cal_softmax(x_train[0])
for i in x_train[1:]:
y_pred = np.row_stack([y_pred, self._cal_softmax(i)])

return self.cross_entropy(y_pred, y_train)

Ref

https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf

https://web.stanford.edu/~jurafsky/slp3/5.pdf