Logistic Regression

Posted on 2021-07-19 Edited on 2021-08-09 In ML Views:

Symbols count in article: 5.4k Reading time ≈ 5 mins.

Logistic Regression

Suppose we have training examples \(D = \{(\mathbf{x}_1, y_1), ...., (\mathbf{x}_N, y_N); \; \mathbf{x}_i \in \mathbb{R}^d\}\), our goal is to make decision about the class of new input \(\mathbf{x}\). The logistic regression does this by learning from a training set, a vector of bias and a matrix of weights.

Binary-Class

In binary class problem, our target \(Y\) takes values \(\{0, 1\}\). To model the distribution \(P(Y | \mathbf{X}; \; \mathbf{w}, b)\), we apply sigmoid function on the dot product of weights and inputs which transform the output to a value between \([0, 1]\) (one criteria for probability):

\[z = \mathbf{x}^T \mathbf{w} + b\]

\[y = \sigma(z)\]

To make sure that class random variable \(Y\)'s conditional pmf sums to 1:

\[P(Y=1 | X=\mathbf{x} ;\; \mathbf{w}, b) = \frac{1}{1 + e^{-z}} = p\]

\[P(Y=0 | X=\mathbf{x} ;\; \mathbf{w}, b) = 1 - \frac{1}{1 + e^{-z}} = \frac{e^{-z}}{1 + e^{-z}} = 1 - p\]

Then, it is equivalently to express this conditional pmf as Bernoulli pmf:

\[p_{Y|\mathbf{X}} (y | \mathbf{x}; \; \mathbf{w}, b) = p^y + (1 - p)^{1 - y}\]

If we have the conditional pmf of \(Y\) given \(X= \mathbf{x}\), then we can use simple decision rule to make decisions:

\[ \hat{y} = \begin{cases} P(Y=1 | X=\mathbf{x}) > 0.5, \quad 1\\ P(Y=1 | X=\mathbf{x}) \leq 0.5, \quad 0 \end{cases} \]

Learning Parameters

Given the dataset \(D\), the conditional likelihood function can be written as:

\[L(\boldsymbol{\theta}; \; D) = \prod^{N}_{i=1} P( y_i | \mathbf{x}_i ; \; \mathbf{\theta})\]

Where \(\boldsymbol{\theta} = <\mathbf{w}, b>\). In general, we could absorb \(b\) in \(\mathbf{w}\) by having an extra \(w_0\) term and append \(1\) as one feature at the beginning of each feature vector \(\mathbf{x}_i\). Our log likelihood becomes:

\[\begin{aligned} l(\boldsymbol{\theta}; \; D) &= \sum^{N}_{i=1} y_i\log (p_i) + (1 - y_i) \log(1 - p_i)\\ &= \sum^{N}_{i=1} y_i\log (\frac{1}{1 + e^{-z_i}}) + (1 - y_i) \log(\frac{e^{-z_i}}{1 + e^{-z_i}})\\ &= \sum^{N}_{i=1} -y_i\log (1 + e^{-z}) + (1 - y_i)z_i - \log(1 + e^{-z_i}) + y_i\log (1 + e^{-z})\\ &= \sum^{N}_{i=1} y_i \mathbf{w}^T\mathbf{x}_i - \log (1 + e^{\mathbf{w}^T\mathbf{x}_i}) \end{aligned}\]

By taking the negative sign in front of this log-likelihood, we have cross-entropy loss of logistic regression.

The partial derivative is then:

\[\frac{l(\boldsymbol{\theta}; \; D)}{\partial w_j} = \sum^{N}_{i=1}(y_i - p_i) x_{ij}\]

We are not going to solve for the parameters by taking gradient and setting it to zero because the equations are non-linear in \(\mathbf{w}\). Thus, one way to do this is to apply gradient decent on the negative of log likelihood (cross-entropy loss).

Multi-Class

In multi-class case, we have \(Y = \{1, ...., K\}\). Instead with one set of parameters \(\boldsymbol \theta = \mathbf{w}\), we now have one set of parameters for each class (total \(K-1\)) \(\;\) \(\boldsymbol \theta = W_{[K \times d]}\). Then we can define the condition pmf for \(i\)th sample using softmax function as:

\[P(Y=y_i | \mathbf{X} = \mathbf{x}_i) = \frac{e^{z_{iy_i}}}{\sum^{K}_{j=1} e^{z_{ij}}} = p_{iy_i}, \quad 1 \leq y_i \leq K\]

This is equivalent to the pmf of multinomial distribution with \(n=1\), which is the categorical distribution:

\[p_{Y | \mathbf{x}}(y_i | \mathbf{x}_i) = \prod^{K}_{j=1} p_{iy_i}^{I[y_i = j]}\]

Then the log likelihood is:

\[l(\boldsymbol{\theta}; \; D) = \sum^{N}_{i=1} \sum^{K}_{j=1} I[y_i = j] \log (p_{iy_i})\]

By taking the negative sign in front of this log-likelihood, we have multi-class cross-entropy loss of logistic regression.

If we are using the one-hot encoding for the class \(y_i\), then we can replace the log likelihood by:

\[l(\boldsymbol{\theta}; \; D) = \sum^{N}_{i=1} y_i \log (p_{iy_i})\]

The partial derivative is then:

\[\frac{l(\boldsymbol{\theta}; \; D)}{\partial w_{jd}} = \sum^{N}_{i=1} (I[y_i = j] - p_{iy_i}) x_{id}\]

Implementation

import numpy as np

class LogisticRegression:

    def __init__(self, optimizer, c=1, max_iter=2000, lr=0.01, tor=0.0000001, verbose=False):
        self.lr = lr
        self.c = c
        self.max_iter = max_iter
        self.optimizer = optimizer
        self.tor = tor
        self.verbose = verbose

        self.weights = None
        self.k = None
        self.d = None

    def fit(self, x_train, y_train):
        n_y, self.k = y_train.shape
        x_train = np.column_stack([np.ones(n_y), x_train])
        n_x, self.d = x_train.shape
        i = 0

        if not self.weights:
            self.weights = np.random.randn(self.d, self.k)

        opt = self.optimizer(lr=self.lr, model_vars=[self.weights])

        prev_matrix = 0
        dif = np.linalg.norm(self.weights - prev_matrix)

        if self.k > 1:
            while (i <= self.max_iter) and (dif >= self.tor):
                prev_matrix = self.weights
                if self.verbose and (i % self.verbose == 0):
                    print(f'iteration: {i}, loss: {self._cal_train_loss(x_train, y_train)}')

                self.weights = opt([self._mc_ce_grad(x_train, y_train)])[0]
                dif = np.linalg.norm(self.weights - prev_matrix)
                i += 1

    def predict(self, x_test):
        x_test = np.column_stack([np.ones(x_test.shape[0]), x_test])
        pred = [np.argmax(self._cal_softmax(x_i)) for x_i in x_test]
            
        return pred


    def _mc_ce_grad(self, x, labels):
        n = x.shape[0]
        output_g = np.zeros((self.d, self.k))

        for i, x_i in enumerate(x):
            soft_max = self._cal_softmax(x_i)
            temp_g = np.repeat(x_i.reshape(-1, 1), self.k, axis=1) * (soft_max - labels[i])
            output_g += temp_g

        return (output_g + self.c * self.weights)/ n

    def _cal_softmax(self, x):
        bot = 0
        top = []

        for k in range(self.k):
            ea = np.exp(np.dot(self.weights[:, k], x))
            bot += ea
            top.append(ea)

        return np.array(top / bot)

    @staticmethod
    def cross_entropy(y_pred, y_true):
        loss = 0
        for i, v in enumerate(y_true):
            y_pred_i = np.log(y_pred[i])
            loss += np.dot(y_pred_i, v.T)

        return -loss / len(y_true)

    def _cal_train_loss(self, x_train, y_train):
        y_pred = self._cal_softmax(x_train[0])
        for i in x_train[1:]:
            y_pred = np.row_stack([y_pred, self._cal_softmax(i)])
            
        return self.cross_entropy(y_pred, y_train)

Ref

https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf

https://web.stanford.edu/~jurafsky/slp3/5.pdf