Logistic Regression
Logistic Regression
Suppose we have training examples \(D = \{(\mathbf{x}_1, y_1), ...., (\mathbf{x}_N, y_N); \; \mathbf{x}_i \in \mathbb{R}^d\}\), our goal is to make decision about the class of new input \(\mathbf{x}\). The logistic regression does this by learning from a training set, a vector of bias and a matrix of weights.
Binary-Class
In binary class problem, our target \(Y\) takes values \(\{0, 1\}\). To model the distribution \(P(Y | \mathbf{X}; \; \mathbf{w}, b)\), we apply sigmoid function on the dot product of weights and inputs which transform the output to a value between \([0, 1]\) (one criteria for probability):
\[z = \mathbf{x}^T \mathbf{w} + b\]
\[y = \sigma(z)\]
To make sure that class random variable \(Y\)'s conditional pmf sums to 1:
\[P(Y=1 | X=\mathbf{x} ;\; \mathbf{w}, b) = \frac{1}{1 + e^{-z}} = p\]
\[P(Y=0 | X=\mathbf{x} ;\; \mathbf{w}, b) = 1 - \frac{1}{1 + e^{-z}} = \frac{e^{-z}}{1 + e^{-z}} = 1 - p\]
Then, it is equivalently to express this conditional pmf as Bernoulli pmf:
\[p_{Y|\mathbf{X}} (y | \mathbf{x}; \; \mathbf{w}, b) = p^y + (1 - p)^{1 - y}\]
If we have the conditional pmf of \(Y\) given \(X= \mathbf{x}\), then we can use simple decision rule to make decisions:
\[ \hat{y} = \begin{cases} P(Y=1 | X=\mathbf{x}) > 0.5, \quad 1\\ P(Y=1 | X=\mathbf{x}) \leq 0.5, \quad 0 \end{cases} \]
Learning Parameters
Given the dataset \(D\), the conditional likelihood function can be written as:
\[L(\boldsymbol{\theta}; \; D) = \prod^{N}_{i=1} P( y_i | \mathbf{x}_i ; \; \mathbf{\theta})\]
Where \(\boldsymbol{\theta} = <\mathbf{w}, b>\). In general, we could absorb \(b\) in \(\mathbf{w}\) by having an extra \(w_0\) term and append \(1\) as one feature at the beginning of each feature vector \(\mathbf{x}_i\). Our log likelihood becomes:
\[\begin{aligned} l(\boldsymbol{\theta}; \; D) &= \sum^{N}_{i=1} y_i\log (p_i) + (1 - y_i) \log(1 - p_i)\\ &= \sum^{N}_{i=1} y_i\log (\frac{1}{1 + e^{-z_i}}) + (1 - y_i) \log(\frac{e^{-z_i}}{1 + e^{-z_i}})\\ &= \sum^{N}_{i=1} -y_i\log (1 + e^{-z}) + (1 - y_i)z_i - \log(1 + e^{-z_i}) + y_i\log (1 + e^{-z})\\ &= \sum^{N}_{i=1} y_i \mathbf{w}^T\mathbf{x}_i - \log (1 + e^{\mathbf{w}^T\mathbf{x}_i}) \end{aligned}\]By taking the negative sign in front of this log-likelihood, we have cross-entropy loss
of logistic regression.
The partial derivative is then:
\[\frac{l(\boldsymbol{\theta}; \; D)}{\partial w_j} = \sum^{N}_{i=1}(y_i - p_i) x_{ij}\]
We are not going to solve for the parameters by taking gradient and setting it to zero because the equations are non-linear in \(\mathbf{w}\). Thus, one way to do this is to apply gradient decent on the negative of log likelihood (cross-entropy loss).
Multi-Class
In multi-class case, we have \(Y = \{1, ...., K\}\). Instead with one set of parameters \(\boldsymbol \theta = \mathbf{w}\), we now have one set of parameters for each class (total \(K-1\)) \(\;\) \(\boldsymbol \theta = W_{[K \times d]}\). Then we can define the condition pmf for \(i\)th sample using softmax function as:
\[P(Y=y_i | \mathbf{X} = \mathbf{x}_i) = \frac{e^{z_{iy_i}}}{\sum^{K}_{j=1} e^{z_{ij}}} = p_{iy_i}, \quad 1 \leq y_i \leq K\]
This is equivalent to the pmf of multinomial distribution with \(n=1\), which is the categorical distribution:
\[p_{Y | \mathbf{x}}(y_i | \mathbf{x}_i) = \prod^{K}_{j=1} p_{iy_i}^{I[y_i = j]}\]
Then the log likelihood is:
\[l(\boldsymbol{\theta}; \; D) = \sum^{N}_{i=1} \sum^{K}_{j=1} I[y_i = j] \log (p_{iy_i})\]
By taking the negative sign in front of this log-likelihood, we have multi-class cross-entropy loss
of logistic regression.
If we are using the one-hot encoding for the class \(y_i\), then we can replace the log likelihood by:
\[l(\boldsymbol{\theta}; \; D) = \sum^{N}_{i=1} y_i \log (p_{iy_i})\]
The partial derivative is then:
\[\frac{l(\boldsymbol{\theta}; \; D)}{\partial w_{jd}} = \sum^{N}_{i=1} (I[y_i = j] - p_{iy_i}) x_{id}\]
Implementation
1 | import numpy as np |
Ref
https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf
https://web.stanford.edu/~jurafsky/slp3/5.pdf