0%

Support Vector Machine

Given dataset \(D = \{(\mathbf{x}_1, y_1) ....., (\mathbf{x}_N, y_N); \; \mathbf{x} \in \mathbb{R}^m, \; y_i \in \{-1, 1\}\}\). Let \(\mathbf{w}^T \mathbf{x} + b = 0\) be a hyperplane, we classify a new point \(\mathbf{x}_i\) by

\[ \hat{y}_i (\mathbf{x}_i) = \begin{cases} \mathbf{w}^T \mathbf{x}_i + b \geq 0, \quad \;\;\; 1\\ \mathbf{w}^T \mathbf{x}_i + b < 0, \quad -1\\ \end{cases} \]

Suppose that our data is linear separable. Then, \(\exists\) at least one possible combination of parameters \(\{\mathbf{w}, b\}\), such that:

\[\hat{\gamma}_i = y_i \hat{y}_i(\mathbf{x}_i) > 0, \; \forall i=1, ...., N\]

When \(y_i = 1\), a confident classifier would have \(\hat{y}_i(\mathbf{x}_i)\) as large as possible. On the other hand, when \(y_i = -1\), the confident classifier would have \(\hat{y}_i(\mathbf{x}_i)\) as negative as possible. Thus, we want \(\hat{\gamma}_i\) as large as possible, this \(\hat{\gamma}_i\) is called functional margin associated with training example for specific set of parameters \(\{\mathbf{w}, b\}\). And functional margin of \(\{\mathbf{w}, b\}\) is defined as minimum of these functions margins:

\[\hat{\gamma} = \min_{i=1, ..., N} \hat{\gamma}_i\]

In support vector machines the decision boundary is chosen to be the one for which the functional margin (confidence) is maximized.

Distance to Plane

Let \(\mathbf{x}_i\) be a sample that has label \(y_i = 1\), thus, it is on the positive side of the hyperplane \(\mathbf{w}^T \mathbf{x} + b = 0\). Define \(r\) to be the shortest distance between point \(\mathbf{x}_i\) and the hyperplane. Then \(r\) is the distance between \(\mathbf{x}_i\) to its projection on the hyperplane \(\mathbf{x}^{\prime}_i\).

Since \(\mathbf{w}\) is the normal vector that is orthogonal to the plane and \(\frac{\mathbf{w}}{\|\mathbf{w}\|_2}\) is the unit vector that represents its direction. We can write the \(r\) as:

\[\mathbf{x}_i - \mathbf{x}^{\prime}_i = r \frac{\mathbf{w}}{\|\mathbf{w}\|_2}\]

\[\implies r = \frac{\mathbf{x}_i - \mathbf{x}^{\prime}_i}{\frac{\mathbf{w}}{\|\mathbf{w}\|_2}}\]

Goal

The concept of the margin is intuitively simple: it is the distance of the separating hyperplane to the closest examples in the dataset, assuming that our dataset is linearly separable. That is:

\[\begin{aligned} &\max \quad && \hat{\gamma}\\ &\;\text{s.t} \quad &&y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq \hat{\gamma} \quad \quad \forall i=1, ...., N \end{aligned}\]

This optimization problem is unbounded, because one can make the functional margin large by simply scaling the parameters by a constant \(c\), \(\{c\mathbf{w}, cb\}\):

\[y_i (c\mathbf{w}^T \mathbf{x}_i + cb) > y_i (\mathbf{w}^T \mathbf{x}_i + b) = \hat{\gamma}\]

This has no effect on the decision plane because:

\[\mathbf{w}^T \mathbf{x}_i + b = c\mathbf{w}^T \mathbf{x}_i + cb = 0\]

Thus, we need to transform the optimization problem to maximize the distance between the samples and decision boundary instead of maximizing functional margin. Suppose we let all functional margins to be at least 1 (can easily achieve by multiplying parameters by a constant):

\[y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1\]

\[\implies \mathbf{w}^T \mathbf{x}_i + b \geq 1\]

\[\implies \mathbf{w}^T \mathbf{x}_i + b \leq -1\]

Then, for point \(\mathbf{x}_i\) on \(\mathbf{w}^T \mathbf{x}_i + b = 1\), we have its distance to the decision plane:

\[\mathbf{w}^T \mathbf{x}_i + b - r \frac{\|\mathbf{w}\|^2_2}{\|\mathbf{w}\|_2} = 0\]

\[\implies r = \frac{1}{\|\mathbf{w}\|}\]

Then, we can formulate our objective as:

\[\begin{aligned} &\max_{\mathbf{w}, b} \quad && \frac{1}{\|\mathbf{w}\|_2}\\ &\;\text{s.t} \quad &&y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \quad \quad \forall i=1, ...., N \end{aligned}\]

Which is equivalent to:

\[\begin{aligned} &\min_{\mathbf{w}, b} \quad && \frac{1}{2}\|\mathbf{w}\|^2_2\\ &\;\text{s.t} \quad &&y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \quad \quad \forall i=1, ...., N \end{aligned}\]


Soft Margin SVM

Slack Variables

Notice, in the above formulation, we have hard constraints on the margins which do not allow misclassification of points. However, in real world, data points are rarely linear separable and there will be outliers in the dataset, we may wish to allow some examples to be on the wrong side of the hyperplane or to have margin less than 1 .

To resolve this problem, we can introduce slack variables one for each data point to relax the hard constraints:

\[y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i\] \[\quad\xi_i \geq 0, \; \; \forall i=1, ...., N\]

To encourage correct classification of the samples, we add \(\xi_i\) to the objective:

\[\begin{aligned} &\min_{\mathbf{w}, b, \boldsymbol{\xi}} \quad && \frac{1}{2}\|\mathbf{w}\|^2_2 + C\sum^{N}_{n=1} \xi_i\\ &\;\text{s.t} \quad &&y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i \quad &&&\forall i=1, ...., N\\ & && \xi_i \geq 0, &&&\forall i=1, ...., N \end{aligned}\]

Thus, sample points are now permitted to have margin less than 1, and if an example \(\mathbf{x}_i\) has slack variable greater than 0, we would have penalty in the objective function \(C\xi_i\). The parameter \(C\) controls the relative weighting between the twin goals of making the \(\|\mathbf{w}\|\) small and of ensuring that most examples have functional margin at least 1.


Dual Problem

Using Lagrange Multiplier, we can transform the constrained problem into an unconstrained concave problem:

\[\max_{\boldsymbol{\alpha}, \boldsymbol{\eta}}\;\min_{\mathbf{w}, b, \boldsymbol{\xi}} \; \frac{1}{2}\|\mathbf{w}\|^2_2 + C\sum^{N}_{n=1} \xi_i - \sum^N_{i=1} \alpha_i [y_i (\mathbf{w}^T \mathbf{x}_i + b) - 1 + \xi_i] - \sum^{N}_{i=1} \eta_i \xi_i\]

Where the inner minimization is the dual function and the maximization w.r.t \(\alpha\) is called dual problem.

KKT

For an unconstrained convex optimization problem, we know we are at global minimum if the gradient is zero. The KKT conditions are the equivalent conditions for the global minimum of a constrained convex optimization problem. \(\forall i=1, ...., N\):

  1. Stationarity, If the strong duality holds, \((\mathbf{w}^*, \boldsymbol{\alpha}^*)\) is optimal, then \(\mathbf{w}^*\) minimizes \(L(\mathbf{w}^*, \boldsymbol{\alpha}^*)\) (same for \(b^*, \xi^*\) which are formulated as constraints in the dual problem):

    \[\nabla_{\mathbf{w}} L(\mathbf{w}^*, \boldsymbol{\alpha}^*) = 0\] \[\implies \mathbf{w}^* = \sum_{i=1}^{n} \alpha^*_i y_i \mathbf{x}_i\]

  2. Complementary Slackness: \[\alpha_i [y_i (\mathbf{w}^T \mathbf{x}_i + b) - 1 + \xi_i] = 0\]

  3. Primal Feasibility: \[y_i (\mathbf{w}^T \mathbf{x}_i + b) - 1 + \xi_i \geq 0\] \[\eta_i\xi_i = 0\]

  4. Dual Feasibility: \[\alpha_i, \eta_i, \xi_i \geq 0\]

Solving Dual Problem

We now solve for the dual function by fixing \(\{\alpha_i\, \eta_i\}\) (satisfying Stationarity condition):

\[\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2_2 + C\sum^{N}_{n=1} \xi_i - \sum^N_{i=1} \alpha_i [y_i (\mathbf{w}^T \mathbf{x}_i + b) - 1 + \xi_i] - \sum^{N}_{i=1} \eta_i \xi_i\]

\[\begin{aligned} & \frac{\partial L(\mathbf{w}, b, \boldsymbol{\xi})}{\partial \mathbf{w}} = \mathbf{w} - \sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i = 0 \\ & \implies \mathbf{w}^* = \sum_{i=1}^{n} \alpha_i y_i \mathbf{x}_i \\ & \frac{\partial L(\mathbf{w},, b, \boldsymbol{\xi})}{\partial b} =\sum_{i=1}^{n} \alpha_i y_i = 0 \\ & \implies \sum_{i=1}^{n} \alpha_i y_i = 0 \\ & \frac{\partial L(\mathbf{w}, b, \boldsymbol{\xi})}{\partial \xi_n} = C - \alpha_n - \eta_n = 0 \\ & \implies \alpha_n = C - \eta_n \end{aligned}\]

Substitute back to the original equation, we obtain the dual function:

\[g(\boldsymbol{\alpha}) = -\frac{1}{2}\sum_{i=1}^{N}\sum_{k=1}^{N}\alpha_i \alpha_k {\mathbf{x}_i}^T \mathbf{x}_k y_i y_k + \sum_{i=1}^{N} \alpha_i\]

Then, we have the dual problem:

\[\begin{aligned} & \underset{\boldsymbol{\alpha}}{\text{max}} & & g(\boldsymbol{\alpha}) = -\frac{1}{2}\sum_{i=1}^{N}\sum_{k=1}^{N}\alpha_i \alpha_k {\mathbf{x}_i}^T \mathbf{x}_k y_i y_k + \sum_{i=1}^{N} \alpha_i\\ & \text{subject to} & & 0 \leq \alpha_i \leq C \\ & & & \sum_{i=1}^{n} \alpha_i y_i = 0 \\ \end{aligned}\]

This is a quadratic programming problem that we can solve using quadratic programming.

Interpretation

We could conclude:

  1. if \(0 < \alpha_i < C \implies y_i(w^T x_i + b) = 1 - \xi_i\) Since \(\alpha_i = C - \mu_i, \mu_i \geq 0\), we have \(\xi_i =0 \implies\) the points are with \(0 < \alpha_i < C\) are on the margin

  2. if \(\alpha_i = C\)

    • \(0 < \xi_i < 1\): the points are inside the margin on the correct side
    • \(\xi_i = 1\): the points are on the decision boundary
    • \(\xi_i > 1\): the points are inside the wrong side of the margin and misclassified
  3. if \(\alpha_i = 0\), the points are not support vectors, have no affect on the weight.

After finding the optimal values for \(\boldsymbol{\alpha}\), we obtain optimal \(\mathbf{w}^*\) by solving:

\[\mathbf{w}^* = \sum_{i=1}^{n} \alpha^*_i y_i \mathbf{x}_i\]

We obtain optimal \(b^*\) by realizing that the points on the margins have \(0 < \alpha_i < C\). Let \(\mathbf{x}_i\) be one of those points, then:

\[{\mathbf{w^*}}^T \mathbf{x}_i + b = y_i\]

Let \(M\) be the set of all points that lies exactly on the margin, a more stable solution is obtained by averaging over all points:

\[b^* = \frac{1}{N_m} \sum^{N_m}_{i=1} (y_i - {\mathbf{w^*}}^T\mathbf{x}_i)\]

Kernel Tricks

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
from cvxopt import matrix, solvers
import numpy as np
from qpsolvers import solve_qp
from sklearn.svm import SVC

class SVM:

def __init__(self, c=1, kernel='linear'):
self.c = c
self.kernel = kernel
self.b = None
self.dual_coef_ = None
self.decision_matrix = None

def fit(self, X, y):
n, d = X.shape
y = y.reshape(-1, 1)
yyt = np.matmul(y, y.T)
P = np.zeros((n, n))
q = matrix(-np.ones((n, 1)))
a = matrix(y.T, tc='d')
b = matrix([0.0])
G = matrix(np.row_stack([np.diag([-1] * n), np.diag([1] * n)]), tc='d')
h = matrix(np.row_stack([np.array([0] * n).reshape(n, 1),
np.array([self.c] * n).reshape(n, 1)]), tc='d')
for i in range(n):
for j in range(n):
P[i][j] = self.apply_kernel(X[i], X[j])

P = matrix(P * yyt)
alpha = np.array(solvers.qp(P, q, G, h , a, b)['x'])
alpha[alpha < np.mean(alpha) * 0.1] = 0
temp_x = np.column_stack([X, alpha, y])
m = temp_x[(temp_x[:, -2] > 0) & (temp_x[:, -2] < self.c)]
N_m = len(m)
self.decision_matrix = m[:, :-2]
self.b = 0
self.dual_coef_ = m[:, -1] * m[:, -2]

## get b
for i in range(N_m):
self.b += m[i, -1]
for j in range(N_m):
self.b -= m[j, -2] * m[j, -1] * self.apply_kernel(m[i, :-2], m[j, :-2])

self.b = self.b / N_m

return self

def apply_kernel(self, x_1, x_2):
if self.kernel == 'linear':
return np.dot(x_1, x_2)

def decision_function(self, X):
pred_results = np.array([])
for i in range(len(X)):
pred = self.b
for j in range(len(self.decision_matrix)):
pred += self.dual_coef_[j] * self.apply_kernel(X[i], self.decision_matrix[j])

pred_results = np.append(pred_results, pred)

return pred_results

def predict(self, X):
pred_results = self.decision_function(X)

return np.where(pred_results >= 0, 1, -1)

Ref

https://www.ccs.neu.edu/home/vip/teach/MLcourse/6_SVM_kernels/lecture_notes/svm/svm.pdf

http://www.cs.cmu.edu/~guestrin/Class/10701-S06/Slides/svms-s06.pdf

MML book

Lagrangian Duality for Dummies, David Knowles

PRML Chapter 7

Naive Bayes

Suppose our training set consists of data samples \(D = \{(\mathbf{x}_1, y_1), ...., (\mathbf{x}_N, y_N)\}, \; \mathbf{x_i} \in \mathbb{R}^d\), where \(D = \{(\mathbf{x}_i, y_i)\}\) are realizations of a random sample that follows unknown joint distribution \(P(\mathbf{X}, Y)\).

Assumptions:

  1. Features are conditionally independent (Naive bayes assumption): \[P(\mathbf{X} | Y) = \prod^{d}_{j=1} P(X_j | Y)\]

  2. MLE assumption: Random sample is identically distributed.

  3. Positional independence: The position of features does not matter (used in Multinomial case).

By applying bayes rule (applying on distribution \(P (\cdot)\) to make things general), we have:

\[P(Y | \mathbf{X}) = \frac{P(\mathbf{X}, Y)}{P(\mathbf{X})} = \frac{P(\mathbf{X} | Y) P(Y)}{P(\mathbf{X})}\]

By substituting the assumption:

\[P(Y | \mathbf{X}) = \frac{\prod^{d}_{j=1} P(X_j | Y)P(Y)}{P(\mathbf{X})}\]

Since the probability distribution \(P(\mathbf{X})\) characterised by \(F_{\mathbf{X}}(\mathbf{x})\) is constant for any given \(\mathbf{x}\), we can drop it from the equation because it only changes \(P(Y | \mathbf{X})\) by a proportion:

\[P(Y | \mathbf{X}) \propto P(Y) \prod^{d}_{j=1} P(X_j | Y)\]

Our goal is to find a class \(\hat{y}\) that maximize the probability given input \(\mathbf{X} = \mathbf{x}\):

\[\begin{aligned} \hat{y} = \underset{y}{\arg\max} \sum^{d}_{j=1} \log P_{X_j|Y}(x_j | y) + \log P_{Y}(y) \end{aligned}\]
Read more »

Logistic Regression

Suppose we have training examples \(D = \{(\mathbf{x}_1, y_1), ...., (\mathbf{x}_N, y_N); \; \mathbf{x}_i \in \mathbb{R}^d\}\), our goal is to make decision about the class of new input \(\mathbf{x}\). The logistic regression does this by learning from a training set, a vector of bias and a matrix of weights.

Binary-Class

In binary class problem, our target \(Y\) takes values \(\{0, 1\}\). To model the distribution \(P(Y | \mathbf{X}; \; \mathbf{w}, b)\), we apply sigmoid function on the dot product of weights and inputs which transform the output to a value between \([0, 1]\) (one criteria for probability):

\[z = \mathbf{x}^T \mathbf{w} + b\]

\[y = \sigma(z)\]

To make sure that class random variable \(Y\)'s conditional pmf sums to 1:

\[P(Y=1 | X=\mathbf{x} ;\; \mathbf{w}, b) = \frac{1}{1 + e^{-z}} = p\]

\[P(Y=0 | X=\mathbf{x} ;\; \mathbf{w}, b) = 1 - \frac{1}{1 + e^{-z}} = \frac{e^{-z}}{1 + e^{-z}} = 1 - p\]

Then, it is equivalently to express this conditional pmf as Bernoulli pmf:

\[p_{Y|\mathbf{X}} (y | \mathbf{x}; \; \mathbf{w}, b) = p^y + (1 - p)^{1 - y}\]

If we have the conditional pmf of \(Y\) given \(X= \mathbf{x}\), then we can use simple decision rule to make decisions:

\[ \hat{y} = \begin{cases} P(Y=1 | X=\mathbf{x}) > 0.5, \quad 1\\ P(Y=1 | X=\mathbf{x}) \leq 0.5, \quad 0 \end{cases} \]

Read more »

Gradient Boosting Decision Trees

Boosting Trees

Regression and classification trees partition the space of all joint predictor variable values into disjoint regions \(R_j, \; j=1, 2, ...., J\) as represented by the terminal nodes of the tree. A constant \(c_j\) is assigned to each such region and the prediction rule is:

\[\mathbf{x} \in R_j \implies f(\mathbf{x}) = c_j\]

Thus, a tree can be formally expressed as:

\[T(\mathbf{x}; \boldsymbol{\theta}) = \sum^{J}_{j=1} c_j I[\mathbf{x} \in R_j]\]

with parameters \(\theta = \{R_j, c_j\}^J_1\). The parameters are found by iteratively solving minimizing problem:

  1. Given \(R_j\), we solve \(\hat{c}_j\) by simply taking the average or majority class.
  2. \(R_j\) is found by iterating over all possible pairs of feature and splitting point.

The boosted tree model is a sum of such trees induced in a forward stagewise manner:

\[f_M (\mathbf{x}) = \sum^{M}_{m=1} T(\mathbf{x}; \; \boldsymbol{\theta}_m)\]

At each step, one must solve:

\[\hat{\boldsymbol{\theta}}_m = \underset{\boldsymbol{\theta}_m}{\arg\min} \sum^{N}_{i=1} L(y_i, f_{m-1} (\mathbf{x}_i) + T(\mathbf{x}_i; \; \boldsymbol{\theta}_m))\]

Gradient Boosting

In general, it is hard to directly take partial derivatives w.r.t the tree's parameters, thus, we take partial derivatives of tree predictions \(f(\mathbf{x}_i)\).

Fast approximate algorithms for solving the above problem with any differentiable loss criterion can be derived by analogy to numerical optimization. We first start with general case. The loss in using \(\mathbf{f}\) to predict \(y\) on the training data is:

\[L(\mathbf{f}) = \sum^{N}_{i=1} L(y_i, f(\mathbf{x}_i))\]

The goal is to minimize \(L(\mathbf{x}_i)\) w.r.t \(f\), where here \(f(\mathbf{x})\) is constrained to be a sum of trees \(f_M (\mathbf{x})\).

We first start with general case where \(f\) can be any parameters or numbers. In this case, we have \(N\) samples, thus, \(\mathbf{f} = \{f(\mathbf{x}_1), ...., f(\mathbf{x}_N)\} \in \mathbb{R}^N\).

Then, the gradient of objective w.r.t \(\mathbf{f} = \mathbf{f}_{m-1}\) which is the current model is:

\[\mathbf{g}_m = \nabla_{\mathbf{f}} L(\mathbf{f}) = \; <\frac{\partial L(\mathbf{f})}{\partial f(\mathbf{x}_1)}, ...., \frac{\partial L(\mathbf{f})}{\partial f(\mathbf{x}_N)}>\]

Then this gradient points at the direction of steepest increase, it tells us how we can change our current predictions to increase our loss. Since we want to minimize the objective, we would like to adjust our current predictions to the direction of steepest decrease:

\[\mathbf{h}_m = - \rho_m \mathbf{g}_m\]

Where \(\rho_m\) is the step size for current model and it is minimizer of:

\[\rho_m = \underset{\rho}{\arg\min} \; L(\mathbf{f}_{m-1} - \rho\mathbf{g}_m)\]

The current solution is then updated as

\[\mathbf{f}_{m} = f_{m-1} - \rho_m\mathbf{g}_m\]

If fitting the training data (minimizing the loss) is our ultimate goal, then the above update rule can solve our problem by adding the negative gradient at each iteration. However, our ultimate goal is to generalize to new data, copying and pasting training data exactly is not what we want. One possible solution is to learn the update \(- \rho_m\mathbf{g}_m\) by fitting a simple decision tree:

\[\hat{\boldsymbol{\theta}}_m = \underset{\boldsymbol{\theta}}{\arg\min} \sum^{N}_{i=1} (-g_{im} - T(x_i; \; \boldsymbol{\theta}))^2\]

That is, we fit a regression tree \(T\) to the negative gradient values.

Algorithm

  1. We first start by a constant model (model that predict constants) which is a single terminal node tree.
  2. For all samples new targets are generated to be the negative gradient of the loss function w.r.t the current model prediction \(\mathbf{f}_{m-1}\).
  3. Fit a regression tree to minimize the MSE between new target (negative gradient) and current prediction.

Discussions

Regularization

For numbers of gradient boosting rounds \(M\), the loss can be made arbitrarily small. However, fitting the data too well can lead to overfitting which degrades the risk on future predictions.

Shrinkage

Controlling the value of \(M\) is not the only possible regularization strategy, we can add penalty terms to the loss function that penalize large \(M\) or we can weight subsequent trees. The simplest implementation of shrinkage in the context of boosting is to scale the contribution of each tree by a factor of \(0 < v < 1\), when it is added to the current approximation:

\[f_m (\mathbf{x}) = f_{m-1}(\mathbf{x}) + v \sum^{J}_{j=1} c_{jm} I[\mathbf{x} \in R_{jm}]\]

The parameter \(v\) can be regarded as controlling the learning rate of the boosting procedure. Smaller values of \(v\) result in larger training error for the same number of iterations \(M\) but might have better generalization. Thus, both \(v\) and \(M\) control prediction risk on the training data. \(v\) and \(M\) tradeoff each other, therefore in practice, it is best to set \(v\) small and control \(M\) by early stopping.

Subsampling

We know that bootstrap averaging (bagging) improves the performance of a noisy classifier through averaging (reduce variance). We can exploit the same idea in gradient boosting.

At each iteration, we sample a fraction \(\eta\) of the training observations without replacement, and grow the next tree using that subsample. A typical value of \(\eta\) is 0.5, although for large sample size \(N\), we can have smaller \(\eta\).

Ref

https://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf

Decision Trees (CART)

Tree based methods partition the feature space into a set of rectangles, and then fit a simple model (i.e constant) in each one. We focus on CART in this post.

Suppose we have dataset \(D = \{(\mathbf{x}_1, y_1) , ...., (\mathbf{x}_N, y_N) ;\; \mathbf{x}_i \in \mathbb{R}^d\}\). The algorithm needs to automatically decide on the splitting variables and splitting points and also what shape the tree should have.

Regression Trees

In this scenario, our response variable \(Y\) is continuous. Suppose first that we have a partition into \(M\) regions \(R_1, ...., R_M\) and we define the model prediction as:

\[\hat{y} = \sum^{M}_{m=1} c_m I[\mathbf{x}_i \in R_m]\]

By minimizing the mean square loss \(\frac{1}{2} \frac{1}{N} \sum^{N}_{i=1} (y_i - \hat{y}_i)^2\), we have:

\[\begin{aligned} \frac{\partial L}{\partial c_m} &= \frac{1}{N}\sum^{N}_{i=1} (y_i - \sum^{M}_{m=1} c_m I[\mathbf{x}_i \in R_m]) I[\mathbf{x}_i \in R_m]\\ &= \frac{1}{N_m}\sum^{N}_{i=1} (y_i I[\mathbf{x}_i \in R_m]) - c_m\\ \implies \hat{c}_m &= \frac{1}{N_m}\sum^{N}_{i=1} (y_i I[\mathbf{x}_i \in R_m]) \end{aligned}\]

Thus, the best estimate \(\hat{c}_m\) in each region is the average training responses in that region w.r.t mean square error:

\[\hat{c}_m = \frac{1}{N_m} \sum^{N}_{i=1} y_i I[\mathbf{x}_i \in R_m]\]

Where \(N_m = \sum^{N}_{i=1} I[\mathbf{x}_i \in R_m]\), is total training examples in region \(R_m\).

Read more »

Heaps

The binary heap data structure is an array object that we can view as a nearly complete binary tree. Each node of the tree corresponds to an element of the array. The tree is completely filled on all levels except possibly the lowes, which is filled from the left up to a point that is the elements in the subarray \(A[\lfloor \frac{n}{2} \rfloor + 1 ... n]\) are all leaves.

An array \(A\) that represents a heap is an object with two attributes:

  1. \(A.length\): gives the number of elements in the array. \(A[1:A.length]\).
  2. \(A.heap\_size\): gives how many elements in the heap are stored within array \(A\). \(A[1:A.heap\_size:A.length]\)

Given the index \(i\) of a node, we can easily compute the indices of its parent, left child and right child by:

  1. parent: \(\lfloor \frac{i}{2} \rfloor\) (by shifting right 1 bit i >> 1)
  2. left child: \(2i\) (by shifting left 1 bit i << 1)
  3. right child: \(2i + 1\) (by shifting left 1 bit and add 1 i << 1 + 1)

There are two types of binary heap:

  1. Max heap: satisfies the max heap property that for every node \(i\) other than the root \(A[parent(i)] \geq A[i]\), that is the maximum value in the array is stored in the root and the subtree rooted at a node contains values no larger than that contained at the node itself (heap sorts).
  2. Min heap: satisfies the min heap property that is organized in the opposite way, for every node \(i\) other than the root \(A[parent(i) \leq A[i]]\) (priority queues)


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class Heap:
def __init__(self, A, heap_size):
# A may not be a heap, call build_min_heap or build_max_heap to convert this instance to heap
assert isinstance(A, list)
self.heap_size = heap_size
self.length = len(A)
self._A = A

def left(self, i):
return i << 1

def right(self, i):
return i << 1 + 1

def parent(self, i):
return i >> 1

def __getitem__(self, index):
return self._A[index]

def append(self, val):
self._A.append(val)

Read more »

Backpropagation In Convolutional Neural Networks

Cross Correlation

Given an input image \(I\) and a filter \(K\) of dimensions \(k_1 \times k_2\), then the cross correlation operation is defined as:

\[(I \otimes K)_{ij} = \sum^{k_2 - 1}_{m=0}\sum^{k_1 - 1}_{n=0} I(i + m, j + n)K(m, n)\]

Convolution

\[(I * K)^{d}_{ij} = \sum^{k_2 - 1}_{m=0}\sum^{k_1 - 1}_{n=0} I(i - m, j - n)K(m, n)\]

which is equivalent to cross correlation with flipped kernel (i.e flipped 180 degree)

\[(I * K)_{ij} = \sum^{k_2 - 1}_{m=0}\sum^{k_1 - 1}_{n=0} I(i + m, j + n) \text{ rot}_{180}(K(m, n))\]

Read more »

Statistics Inference

Sampling and Statistics

In a typical statistical problem, we have \(p_X(x), f_X(x)\) unknown. Our ignorance about the \(p_X(x), f_X(x)\) can roughly be classified in one of two ways:

  1. \(f(x)\) or \(p_X(x)\) is completely unknown.
  2. The form of \(f_X(x)\) or \(p_X(x)\) is known down to an unknown parameter vector \(\mathbf{\theta}\).

We will focus on the second case. We often denote this problem by saying that the random variable \(X\) has density or mass function of the form \(f_{X}(x; \theta)\) or \(p_X(x;\theta)\), where \(\theta \in \Omega\) for a special parameter space \(\Omega\). Since \(\theta\) is unknown, we want to estimate it.

In the process, our information about case 1 or 2 comes from a sample on \(X\). The sample observations have the same distribution as \(X\), and we denote them as \(X_1, ...., X_n\) where \(n\) is the sample size. When the sample is actually drawn, we use lower case letters to represent the realization \(x_1, ..., x_n\) of random samples.

Once the sample is drawn, then \(t\) is called the realization of \(T\), where \(t = T(x_1, ...., x_n)\).

Read more »

Value Function Approximation (Algorithms)

Approximate Value Iteration

Population Version

Recall that the procedure for VI is:

\[V_{k+1} \leftarrow T^{\pi}V_{k}\] \[V_{k+1} \leftarrow T^{*}V_{k}\]

One way to develop its approximation version is to perform each step only approximately (i.e find \(V_{k+1} \in \mathbf{F}\)) such that:

\[V_{k+1} \approx TV_{k}\]

Where \(T\) can be \(T^*\) or \(T^\pi\).

We start from a \(V_0 \in \mathbf{F}\), and then at each iteration \(k\) of AVI, we solve (i.e \(p\) is often 2):

\[V_{k+1} \leftarrow \underset{V \in \mathbf{F}}{\arg\min} \|V - TV_{k}\|^p_{p, \mu}\]

At each iteration, \(TV_k\) (our target) may not be within \(\mathbf{F}\) anymore even though \(V_{k} \in \mathbf{F}\), we may have some approximation error at each iteration of AVI. The amount of error depends on how expressive \(\mathbf{F}\) is and how much \(T\) can push a function within \(\mathbf{F}\) outside that space.

Batch Version

The objective of AVI cannot be computed because:

  • \(\mu\) is unknown
  • Environment \(R, P\) is often not available, thus \(TQ_k\) cannot be computed.

If we have samples in the form of \((X, A, R, X^{\prime})\), we can compute the unbiased sample of \(TQ_k\):

\[\hat{T}^{\pi}Q_k = R + \gamma Q(X^{\prime}, A^{\prime})\] \[\hat{T}^{*} Q_k = R + \gamma \max_{a^{\prime} \in A} Q(X^{\prime}, a^{\prime}) \]

Where \(A^{\prime} \sim \pi(\cdot | X^{\prime})\)


The question is: can we replace \(TQ_k\) with \(\hat{T}Q_k\)?

Given any \(Z = (X, A)\) \[\begin{aligned} E_{\hat{T}Q_k}[|Q(Z) - \hat{T}Q_k (Z)|^2 | Z] &= E_{\hat{T}Q_k}[|Q(Z) - TQ_k (Z) + TQ_k (Z) - \hat{T}Q_k (Z)|^2 | Z]\\ &= E_{\hat{T}Q_k}[|Q(Z) - TQ_k (Z)|^2 | Z] + E_{\hat{T}Q_k}[|TQ_k (Z) - \hat{T}Q_k(Z)|^2 | Z] + 2 E_{\hat{T}Q_k}[(Q(Z) - TQ_k (Z))(TQ_k (Z) - \hat{T}Q_k (Z)) | Z]\\ &= E_{\hat{T}Q_k}[|Q(Z) - TQ_k (Z)|^2 | Z] + E_{\hat{T}Q_k}[|TQ_k (Z) - \hat{T}Q_k(Z)|^2 | Z] \end{aligned}\]

Since \(Z \sim \mu\):

\[\begin{aligned} E_{\mu} [E_{\hat{T}Q_k}[|Q(Z) - TQ_k (Z)|^2 | Z] + E_{\hat{T}Q_k}[|TQ_k (Z) - \hat{T}Q_k(Z)|^2 | Z]] &= E_{\mu, \hat{T}Q_k} [|Q(Z) - TQ_k (Z)|^2] + E_{\mu, \hat{T}Q_k} [|TQ_k (Z) - \hat{T}Q_k(Z)|^2]\\ &= \|Q(Z) - TQ_k (Z)\|^2_{2, \mu} + E_{\mu} [Var[\hat{T}Q_k (Z) | Z]] \end{aligned}\]

Since the expectation of variance term does not depend on \(Q\), the solution to the surrogate objective is the same as the original objective:

\[\underset{Q \in \mathbf{F}}{\arg\min} \; E_{\mu, \hat{T}Q_k}[|Q(Z) - \hat{T}Q_k (Z)|^2] = \underset{Q \in \mathbf{F}}{\arg\min} \; \|Q(Z) - TQ_k (Z)\|^2_{2, \mu}\]

Similar to ERM, we do not know the environment dynamics \(R, P\) and distribution \(\mu\). We can use samples and estimate the expectation:

\[\frac{1}{N} \sum^{N}_{i=1} |Q(X_i, A_i) - \hat{T}Q_k (X_i, A_i)|^2 = \|Q - \hat{T}Q_k\|^2_{2, D_n}\]

This is the basis of DQN.

Bellman Residual Minimization

Population Version

Recall that:

\[V = T^{\pi}V \implies V = V^{\pi}\]

Under FA, we may not achieve this exact equality, instead:

\[V \approx V^{\pi}\]

Thus, we can formulate our objective:

\[V = \underset{V \in \mathbf{F}}{\arg\min} \|V - T^{\pi}V\|^p_{p, \mu} = \|BR(V)\|^p_{p, \mu}\]

By minimizing this objective (\(p\) is usually 2), we have Bellman Residual Minimization.

This procedure is different from AVI in that we do not mimic the iterative process of VI (which is convergent in the exact case without any FA), but instead directly go for the solution of fixed-point equation (\(V - T^{\pi}V\) instead of \(V - T^{\pi}V_k\)).

If there exists a \(V \in \mathbf{F}\) that makes \(\|V - T^{\pi}V\|^2_{2, \mu} = 0\) and if we assume \(\mu(x) > 0, \; \forall x \in \chi\), we can conclude that \(V(x) = V^{\pi}(x), \; \forall x \in \chi\) so \(V = V^{\pi}\).

Batch Version

Similar to AVI, we may want to replace \(TV\) or \(TQ\) by \(\hat{T}Q\). Thus, our empirical objective is:

\[Q = \underset{Q \in \mathbf{F}}{\arg\min} \frac{1}{N} \sum^N_{i=1} |Q(X_i, A_i) - \hat{T}^{\pi} Q (X_i, A_i)|^2 = \|Q - \hat{T}^{\pi}Q\|^2_{2, D_n}\]

Using \(D_n = \{(X_i, A_i, R_i, X^{\prime}_i)\}^N_{i=1}\)

We can see that \(Q\) appears in both \(\hat{T}^{\pi}Q\) and \(Q\), which is different from AVI and ERM. This causes an issue: the minimizer of \(\|Q - T^{\pi}Q\|^2_{2, \mu}\) and \(\|Q - \hat{T}^{\pi}Q\|^2_{2, \mu}\) are not necessarily the same for stochastic dynamics.


To see this, for any \(Q\) and \(Z = (X, A)\) we compute:

\[E_{\hat{T}^{\pi}Q} [|Q(Z) - \hat{T}Q(Z)|^2 | Z]\]

Then:

\[\begin{aligned} E_{\hat{T}^{\pi}Q}[|Q(Z) - \hat{T}^{\pi}Q (Z)|^2 | Z] &= E_{\hat{T}^{\pi}Q}[|Q(Z) - T^{\pi}Q (Z) + T^{\pi}Q (Z) - \hat{T}^{\pi}Q (Z)|^2 | Z]\\ &= E_{\hat{T}^{\pi}Q}[|Q(Z) - T^{\pi}Q (Z)|^2 | Z] + E_{\hat{T}^{\pi}Q}[|T^{\pi}Q (Z) - \hat{T}^{\pi}Q(Z)|^2 | Z] + 2 E_{\hat{T}^{\pi}Q}[(Q(Z) - T^{\pi}Q (Z))(T^{\pi}Q (Z) - \hat{T}^{\pi}Q (Z)) | Z]\\ &= E_{\hat{T}^{\pi}Q}[|Q(Z) - T^{\pi}Q (Z)|^2 | Z] + E_{\hat{T}^{\pi}Q}[|T^{\pi}Q (Z) - \hat{T}^{\pi}Q(Z)|^2 | Z] \end{aligned}\]

Since \(Z \sim \mu\), the first term is:

\[E_{\hat{T}^{\pi}Q, \mu}[|Q(Z) - T^{\pi}Q (Z)|^2] = \|Q - T^{\pi}Q\|^2_{2, \mu}\]

The second term is:

\[E_{\mu}[E_{\hat{T}^{\pi}Q}[|T^{\pi}Q (Z) - \hat{T}^{\pi}Q(Z)|^2 | Z]] = E_{\mu}[Var[\hat{T}^{\pi}Q(Z) | Z]]\]

We can see that the variance term \(Var[\hat{T}^{\pi}Q(Z)\) depends on \(Q\), as we minimize the objective w.r.t \(Q\) in stochastic dynamical systems (for deterministic ones, it is zero), we have $ E_{}[Var[^{}Q(Z) | Z]] $, so the minimizer of the batch version objective is not the same as population version for BRM in stochastic dynamics.

Projected Bellman Error

From BRM, we know that even though \(V \in \mathbf{F}\), \(T^{\pi} V\) may not be in \(\mathbf{F}\). Thus, a good approximator \(V \in \mathbf{F}\) should have distance to \(T^{\pi}V\) small. Thus, we want to find \(V \in \mathbf{F}\) such that \(V\) is the projection of \(T^{\pi}V\) onto the space \(\mathbf{F}\).

We want to find a \(V \in \mathbf{F}\) such that:

\[V = \prod_{\mathbf{F}, \mu} T^{\pi}V\]

Where \(\prod_{\mathbf{F}, \mu}\) is the projection operator.

TRhe projectio operator \(\prod_{\mathbf{F}, \mu}\) is a linear operator that takes \(V \in B(\chi)\) and maps it to closest point on \(F\) measured according to its \(L_{2} (\mu)\) norm:

\[\prod_{\mathbf{F}, \mu} V \triangleq \underset{V^{\prime} \in \mathbf{F}}{\arg\min} \|V^{\prime} - V\|^2_{2, \mu}\]

It has some properties:

  1. The projection belongs to function space \(\mathbf{F}\): \[\prod_{\mathbf{F}, \mu} V \in \mathbf{F}\]
  2. If \(V \in \mathbf{F}\), the projection is itself: \[V \in \mathbf{F} \implies \prod_{\mathbf{F}, \mu} V = V\]
  3. The projection operator onto a subspace is a non-expansion: \[\|\prod_{\mathbf{F}, \mu} V_1 - \prod_{\mathbf{F}, \mu} V_2\|^2_{2, \mu} \leq \|V_1 - V_2\|^2_{2, \mu}\]

Population Version

We can define a loss function based on \(V = \prod_{\mathbf{F}, \mu} T^{\pi}V\):

\[\|V - \prod_{\mathbf{F}, \mu} T^{\pi}V\|^2_{2, \mu}\]

This is called Projected Bellman Error or Mean Square Projected Bellman Error.

We find the value function by solving the following optimization problem:

\[V = \underset{V \in \mathbf{F}}{\arg\min} \|V - \prod_{\mathbf{F}, \mu} T^{\pi}V\|^2_{2, \mu}\]

Since \(V \in \mathbf{F}\), the projection operator is linear:

\[\begin{aligned} V - \prod_{\mathbf{F}, \mu} T^{\pi}V &= \prod_{\mathbf{F}, \mu} V - \prod_{\mathbf{F}, \mu} T^{\pi}V\\ &= \prod_{\mathbf{F}, \mu} (V - T^{\pi}V)\\ &= - \prod_{\mathbf{F}, \mu} (T^{\pi}V - V)\\ &= - \prod_{\mathbf{F}, \mu} (BR(V)) \end{aligned}\]

So the objective can be rewritten as:

\[V = \underset{V \in \mathbf{F}}{\arg\min} \|\prod_{\mathbf{F}, \mu} BR(V)\|^2_{2, \mu}\]

Which is the norm of the projection of the Bellman Residual onto \(\mathbf{F}\).


We can think of solving the projected bellman error objective as solving the two coupled optimization problems:

  1. Find the projection point given value function \(V^{\prime}\): \[V^{\prime\prime} = \underset{V^{\prime\prime} \in \mathbf{F}}{\arg\min} \|V^{\prime\prime} - T^{\pi}V^{\prime}\|^2_{2, \mu}\]
  2. Find the value function \(V^{\prime}\) on \(\mathbf{F}\) that is closest to the projection point \(V^{\prime\prime}\): \[V^{\prime} = \underset{V^{\prime} \in \mathbf{F}}{\arg\min} \|V^{\prime} - V^{\prime\prime}\|^2_{2, \mu}\]

Least Square Temporal Difference Learning (Population Version)

If \(\mathbf{F}\) is a linear FA with basis functions \(\phi_1, ...., \phi_p\):

\[\mathbf{F}: \{x \rightarrow \boldsymbol{\phi}(x)^T \mathbf{w}; \; \mathbf{w} \in \mathbb{R}^p\}\]

We can find a direct solution to the PBE objective, that is we want to find \(V \in \mathbf{F}\) s.t:

\[V = \prod_{\mathbf{F}, \mu} T^{\pi}V\]

We assume that:

  • \(\chi\) is finite and \(|\chi| = N\), \(N \gg p\), each \(\boldsymbol{\phi}_i = [\phi_i(x_1) .... \phi_i(x_N)]^T\) is a \(N\)-dimentional vector.

Then, we define \(\Phi_{N\times p}\) as the matrix of concatenating all features:

\[\Phi = [\boldsymbol{\phi}_1 .... \boldsymbol{\phi}_p]_{N\times p}\]

The value function \(V \in \mathbf{F}\) is then:

\[V_{N \times 1} = \Phi_{N\times p} \mathbf{w}_{p \times 1}\]

Our goal is to find a weight vector \(\mathbf{w} \in \mathbb{R}^p\) such that:

\[V = \prod_{\mathbf{F}, \mu} T^{\pi}V \implies \Phi\mathbf{w} = \prod_{\mathbf{F}, \mu} T^{\pi}(\Phi \mathbf{w})\]

Let \(M = \text{diag}(\mu)\), since:

\[\|V\|^2_{2, \mu} = <V, V>_{\mu} = \sum_{x \in \chi} |V(x)|^2\mu(x) = V^TMV\]

\[<V_1, V_2>_{\mu} = \sum_{x \in \chi} V_1(x)V_2(x) = V^T_1 M V_2\]

Then, the projection operator onto the linear \(\mathbf{F}\) would be:

\[\begin{aligned} \prod_{\mathbf{F}, \mu} V &= \underset{V^{\prime} \in \mathbf{F}}{\arg\min}\|V^{\prime} - V\|^2_{2, \mu}\\ &= \underset{\mathbf{w} \in \mathbb{R}^p}{\arg\min}\|\Phi\mathbf{w} - V\|^2_{2, \mu}\\ &= \underset{\mathbf{w} \in \mathbb{R}^p}{\arg\min} (\Phi\mathbf{w} - V)^T M (\Phi\mathbf{w} - V) \end{aligned}\]

By taking the derivative and setting it to zero (assuming that \(\Phi^TM\Phi\) is invertible):

\[\frac{\partial \prod_{\mathbf{F}, \mu} V}{\partial \mathbf{w}} = \Phi^TM(\Phi\mathbf{w} - V) = 0\] \[\implies \mathbf{w^{*}} = (\Phi^TM\Phi)^{-1}\Phi^TMV\]

Then the projection is:

\[\prod_{\mathbf{F}, \mu} V = \Phi \mathbf{w} = \Phi (\Phi^TM\Phi)^{-1}\Phi^TMV\]

Since \(T^{\pi} V = r^{\pi} (x) + \gamma \sum_{x \in \chi, a \in \mathbf{A}}P^{\pi}(x^{\prime} | x, a) \pi(a | s) V(x^{\prime})\), we can write it in vector form for all states:

\[(T^{\pi}V)_{N\times 1} = \mathbf{r}^{\pi}_{N \times 1} + \gamma P^{\pi}_{N\times N} V_{N \times 1}\] \[\implies (T^{\pi}\Phi\mathbf{w})_{N\times 1} = \mathbf{r}^{\pi}_{N \times 1} + \gamma P^{\pi}_{N\times N} \Phi_{N\times p}\mathbf{w}_{p \times 1}\]

Substitute the equation of \(T^{\pi}\Phi\mathbf{w}\) into \(V\) in the projection equation:

\[\prod_{\mathbf{F}, \mu} T^{\pi}\Phi\mathbf{w} = \Phi \mathbf{w} = \Phi (\Phi^TM\Phi)^{-1}\Phi^TM(T^{\pi}\Phi\mathbf{w})\] \[\implies \Phi \mathbf{w} = [\Phi (\Phi^TM\Phi)^{-1}\Phi^TM][\mathbf{r}^{\pi} + \gamma P^{\pi} \Phi\mathbf{w}]\]

Multiply both sides by \(\Phi^T M\) and simply:

\[\Phi^T M \Phi \mathbf{w} = [\Phi^T M\Phi (\Phi^TM\Phi)^{-1}\Phi^TM][\mathbf{r}^{\pi} + \gamma P^{\pi} \Phi\mathbf{w}]\] \[\implies \Phi^T M \Phi \mathbf{w} = \Phi^TM[\mathbf{r}^{\pi} + \gamma P^{\pi} \Phi\mathbf{w}]\] \[\implies \Phi^T M \Phi \mathbf{w} - \Phi^TM[\mathbf{r}^{\pi} + \gamma P^{\pi} \Phi\mathbf{w}] = 0\] \[\implies \Phi^T M[\mathbf{r}^{\pi} + \gamma P^{\pi} \Phi\mathbf{w} - \Phi \mathbf{w}] = 0\] \[\implies \mathbf{w}_{TD} = [\Phi^T M (\Phi - \gamma P^{\pi}\Phi)]^{-1}\Phi^T M \mathbf{r}^{\pi}\]

The solution \(\mathbf{w}_{TD}\) is called Least Square Temporal Difference Method (LSTD).

Notice that the term \(\mathbf{r}^{\pi} + \gamma P^{\pi} \Phi\mathbf{w} - \Phi \mathbf{w} = T^{\pi} V - V = BR(V)\). Hence:

\[\Phi^TM (BR(V)) = 0\] \[\implies <\mathbf{\phi}_i, BR(V)> = 0\]

Thus, LSTD finds a \(\mathbf{w}\) s.t the Bellman Residual is orthogonal to the basis of \(\mathbf{F}\) which is exactly what we want.

Batch Version

We have showed that the solution to \(V = \prod_{\mathbf{F}, \mu} T^{\pi}V\) with linear FA \(V = \Phi\mathbf{w}\) is:

\[\mathbf{w}_{TD} = A^{-1}_{p\times p} \mathbf{b}_{p \times 1}\]

Where:

  • \(A^{-1} = \Phi^T M (\Phi - \gamma P^{\pi}\Phi)]^{-1}\)
  • \(\mathbf{b} = \Phi^T M \mathbf{r}^{\pi}\)

Expand \(A\) and \(\mathbf{b}\) in terms of summations, we have:

\[A_{ij} = \sum^{N}_{n=1} \Phi^T_{in} \mu(n) (\Phi - \gamma P^{\pi}\Phi)_{ij}\] \[b_i = \sum^N_{n=1} \Phi^T_{in} \mu(n) r^{\pi}_n\]

Thus, in terms of current state \(x\) and next state \(x^{\prime}\):

\[A = \sum_{x \in \chi} \mu (x) \boldsymbol{\phi} (x) [\boldsymbol{\phi}(x) - \gamma \sum_{x^{\prime} \in \chi} P^{\pi}(x^{\prime} | x)\boldsymbol{\phi}(x^{\prime})]^T = E_{\mu}[\boldsymbol{\phi}(X) [\boldsymbol{\phi}(X) - \gamma E_{P^{\pi}}[\boldsymbol{\phi} (X^{\prime})]]^T]\]

\[\mathbf{b} = \sum_{x \in \chi} \mu(x) \boldsymbol{\phi} (x) r^{\pi}(x) = E_{\mu} [\boldsymbol{\phi}(X)r^{\pi}(X)]\]

Given data set \(D_n = \{X_i, R_i, X^{\prime}_i\}^{M}_{i=1}\) with \(X_i \sim \mu\), \(X^{\prime} \sim P^{\pi}(\cdot | X_i)\) and \(R_i \sim R^{\pi}(\cdot | X_i)\), we define the empirical estimator \(\hat{A}_n, \hat{ \mathbf{b}}_n\) as:

\[\hat{A}_n = \frac{1}{M} \sum^{M}_{i=1} \boldsymbol{\phi}(X_i) [\boldsymbol{\phi}(X_i) - \gamma \boldsymbol{\phi} (X^{\prime}_i)]^T\] \[\hat{\mathbf{b}}_n = \frac{1}{M} \sum^{M}_{i=1} \boldsymbol{\phi}(X)R_i\]


We can use LSTD to define an approximate policy iteration procedure to obtain a close to optimal policy (LSPI):

Semi-Gradient TD

Suppose that we know the true value function \(V^{\pi}\) and we want to find an approximation \(\hat{V}\), parameterized by \(\mathbf{w}\). The population loss:

\[V = \underset{\hat{V} \in \mathbf{F}}{\arg\min} \frac{1}{2}\|V^{\pi} - \hat{V}\|^2_{2, \mu}\]

Using samples \(X_t \sim \mu\), we can define an SGD procedure that updates \(\mathbf{w}_t\) as follows:

\[\begin{aligned} \mathbf{w}_{t+1} &\leftarrow \mathbf{w}_t - \alpha_t \nabla_{\mathbf{w}_t} [\frac{1}{2} |V^{\pi} (X_t) - \hat{V}(X_t; \mathbf{w}_t)|^2]\\ &= \mathbf{w}_t + \alpha_t (V^{\pi} (X_t) - \hat{V}(X_t; \mathbf{w}_t))\nabla_{\mathbf{w}_t} \hat{V}(X_t; \mathbf{w}_t) \end{aligned}\]

If we select proper step size \(\alpha_t\), then the SGD converges to the stationary point.

When we do not know \(V^{\pi}\), we can use bootstrapped estimate (TD estimate \(\hat{T}^{\pi}V_t\)) instead:

\[\mathbf{w}_{t+1} \leftarrow \mathbf{w}_t + \alpha_t (\hat{T}^{\pi}V_t - \hat{V}(X_t; \mathbf{w}_t))\nabla_{\mathbf{w}_t} \hat{V}(X_t; \mathbf{w}_t)\] \[\mathbf{w}_{t+1} \leftarrow \mathbf{w}_t + \alpha_t (R_t + \hat{V}(X^{\prime}_t; \mathbf{w}_t) - \hat{V}(X_t; \mathbf{w}_t))\nabla_{\mathbf{w}_t} \hat{V}(X_t; \mathbf{w}_t)\]

The substitution of \(V^{\pi}(X_t)\) with \(\hat{T}^{\pi}V_t (X_t)\) does not follow from the SGD of any loss function. The TD update is note a true SGD update, that is, we call it a semi-gradient update.

题目库: [94, 144, 145, 226, 110, 105, 107, 543, 113, 116, 129, 108, 617, 404, 222, 'jz40', 912, 1046, 451, 703, 206, 2 , 21, 19, 692, 24, 83, 82, 876, 328, 143, 86, 61, 92, 160]

Trees

94 Easy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Definition for a binary tree node.
# class TreeNode:
# def __init__(self, val=0, left=None, right=None):
# self.val = val
# self.left = left
# self.right = right
class Solution:
def inorderTraversal(self, root: TreeNode) -> List[int]:
out = []

def inorder(root):
if not root:
return

inorder(root.left)
out.append(root.val)
inorder(root.right)

inorder(root)
return out

144 Easy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Definition for a binary tree node.
# class TreeNode:
# def __init__(self, val=0, left=None, right=None):
# self.val = val
# self.left = left
# self.right = right
class Solution:
def preorderTraversal(self, root: TreeNode) -> List[int]:
out = []
def preorder(node):
if not node:
return

out.append(node.val)
preorder(node.left)
preorder(node.right)

preorder(root)
return out
Read more »