0%

Support Vector Machine

Given dataset D={(x1,y1).....,(xN,yN);xRm,yi{1,1}}. Let wTx+b=0 be a hyperplane, we classify a new point xi by

y^i(xi)={wTxi+b0,1wTxi+b<0,1

Suppose that our data is linear separable. Then, at least one possible combination of parameters {w,b}, such that:

γ^i=yiy^i(xi)>0,i=1,....,N

When yi=1, a confident classifier would have y^i(xi) as large as possible. On the other hand, when yi=1, the confident classifier would have y^i(xi) as negative as possible. Thus, we want γ^i as large as possible, this γ^i is called functional margin associated with training example for specific set of parameters {w,b}. And functional margin of {w,b} is defined as minimum of these functions margins:

γ^=mini=1,...,Nγ^i

In support vector machines the decision boundary is chosen to be the one for which the functional margin (confidence) is maximized.

Distance to Plane

Let xi be a sample that has label yi=1, thus, it is on the positive side of the hyperplane wTx+b=0. Define r to be the shortest distance between point xi and the hyperplane. Then r is the distance between xi to its projection on the hyperplane xi.

Since w is the normal vector that is orthogonal to the plane and ww2 is the unit vector that represents its direction. We can write the r as:

xixi=rww2

r=xixiww2

Goal

The concept of the margin is intuitively simple: it is the distance of the separating hyperplane to the closest examples in the dataset, assuming that our dataset is linearly separable. That is:

maxγ^s.tyi(wTxi+b)γ^i=1,....,N

This optimization problem is unbounded, because one can make the functional margin large by simply scaling the parameters by a constant c, {cw,cb}:

yi(cwTxi+cb)>yi(wTxi+b)=γ^

This has no effect on the decision plane because:

wTxi+b=cwTxi+cb=0

Thus, we need to transform the optimization problem to maximize the distance between the samples and decision boundary instead of maximizing functional margin. Suppose we let all functional margins to be at least 1 (can easily achieve by multiplying parameters by a constant):

yi(wTxi+b)1

wTxi+b1

wTxi+b1

Then, for point xi on wTxi+b=1, we have its distance to the decision plane:

wTxi+brw22w2=0

r=1w

Then, we can formulate our objective as:

maxw,b1w2s.tyi(wTxi+b)1i=1,....,N

Which is equivalent to:

minw,b12w22s.tyi(wTxi+b)1i=1,....,N


Soft Margin SVM

Slack Variables

Notice, in the above formulation, we have hard constraints on the margins which do not allow misclassification of points. However, in real world, data points are rarely linear separable and there will be outliers in the dataset, we may wish to allow some examples to be on the wrong side of the hyperplane or to have margin less than 1 .

To resolve this problem, we can introduce slack variables one for each data point to relax the hard constraints:

yi(wTxi+b)1ξi ξi0,i=1,....,N

To encourage correct classification of the samples, we add ξi to the objective:

minw,b,ξ12w22+Cn=1Nξis.tyi(wTxi+b)1ξii=1,....,Nξi0,i=1,....,N

Thus, sample points are now permitted to have margin less than 1, and if an example xi has slack variable greater than 0, we would have penalty in the objective function Cξi. The parameter C controls the relative weighting between the twin goals of making the w small and of ensuring that most examples have functional margin at least 1.


Dual Problem

Using Lagrange Multiplier, we can transform the constrained problem into an unconstrained concave problem:

maxα,ηminw,b,ξ12w22+Cn=1Nξii=1Nαi[yi(wTxi+b)1+ξi]i=1Nηiξi

Where the inner minimization is the dual function and the maximization w.r.t α is called dual problem.

KKT

For an unconstrained convex optimization problem, we know we are at global minimum if the gradient is zero. The KKT conditions are the equivalent conditions for the global minimum of a constrained convex optimization problem. i=1,....,N:

  1. Stationarity, If the strong duality holds, (w,α) is optimal, then w minimizes L(w,α) (same for b,ξ which are formulated as constraints in the dual problem):

    wL(w,α)=0 w=i=1nαiyixi

  2. Complementary Slackness: αi[yi(wTxi+b)1+ξi]=0

  3. Primal Feasibility: yi(wTxi+b)1+ξi0 ηiξi=0

  4. Dual Feasibility: αi,ηi,ξi0

Solving Dual Problem

We now solve for the dual function by fixing {αiηi} (satisfying Stationarity condition):

minw,b,ξ12w22+Cn=1Nξii=1Nαi[yi(wTxi+b)1+ξi]i=1Nηiξi

L(w,b,ξ)w=wi=1nαiyixi=0w=i=1nαiyixiL(w,,b,ξ)b=i=1nαiyi=0i=1nαiyi=0L(w,b,ξ)ξn=Cαnηn=0αn=Cηn

Substitute back to the original equation, we obtain the dual function:

g(α)=12i=1Nk=1NαiαkxiTxkyiyk+i=1Nαi

Then, we have the dual problem:

maxαg(α)=12i=1Nk=1NαiαkxiTxkyiyk+i=1Nαisubject to0αiCi=1nαiyi=0

This is a quadratic programming problem that we can solve using quadratic programming.

Interpretation

We could conclude:

  1. if 0<αi<Cyi(wTxi+b)=1ξi Since αi=Cμi,μi0, we have ξi=0 the points are with 0<αi<C are on the margin

  2. if αi=C

    • 0<ξi<1: the points are inside the margin on the correct side
    • ξi=1: the points are on the decision boundary
    • ξi>1: the points are inside the wrong side of the margin and misclassified
  3. if αi=0, the points are not support vectors, have no affect on the weight.

After finding the optimal values for α, we obtain optimal w by solving:

w=i=1nαiyixi

We obtain optimal b by realizing that the points on the margins have 0<αi<C. Let xi be one of those points, then:

wTxi+b=yi

Let M be the set of all points that lies exactly on the margin, a more stable solution is obtained by averaging over all points:

b=1Nmi=1Nm(yiwTxi)

Kernel Tricks

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
from cvxopt import matrix, solvers
import numpy as np
from qpsolvers import solve_qp
from sklearn.svm import SVC

class SVM:

def __init__(self, c=1, kernel='linear'):
self.c = c
self.kernel = kernel
self.b = None
self.dual_coef_ = None
self.decision_matrix = None

def fit(self, X, y):
n, d = X.shape
y = y.reshape(-1, 1)
yyt = np.matmul(y, y.T)
P = np.zeros((n, n))
q = matrix(-np.ones((n, 1)))
a = matrix(y.T, tc='d')
b = matrix([0.0])
G = matrix(np.row_stack([np.diag([-1] * n), np.diag([1] * n)]), tc='d')
h = matrix(np.row_stack([np.array([0] * n).reshape(n, 1),
np.array([self.c] * n).reshape(n, 1)]), tc='d')
for i in range(n):
for j in range(n):
P[i][j] = self.apply_kernel(X[i], X[j])

P = matrix(P * yyt)
alpha = np.array(solvers.qp(P, q, G, h , a, b)['x'])
alpha[alpha < np.mean(alpha) * 0.1] = 0
temp_x = np.column_stack([X, alpha, y])
m = temp_x[(temp_x[:, -2] > 0) & (temp_x[:, -2] < self.c)]
N_m = len(m)
self.decision_matrix = m[:, :-2]
self.b = 0
self.dual_coef_ = m[:, -1] * m[:, -2]

## get b
for i in range(N_m):
self.b += m[i, -1]
for j in range(N_m):
self.b -= m[j, -2] * m[j, -1] * self.apply_kernel(m[i, :-2], m[j, :-2])

self.b = self.b / N_m

return self

def apply_kernel(self, x_1, x_2):
if self.kernel == 'linear':
return np.dot(x_1, x_2)

def decision_function(self, X):
pred_results = np.array([])
for i in range(len(X)):
pred = self.b
for j in range(len(self.decision_matrix)):
pred += self.dual_coef_[j] * self.apply_kernel(X[i], self.decision_matrix[j])

pred_results = np.append(pred_results, pred)

return pred_results

def predict(self, X):
pred_results = self.decision_function(X)

return np.where(pred_results >= 0, 1, -1)

Ref

https://www.ccs.neu.edu/home/vip/teach/MLcourse/6_SVM_kernels/lecture_notes/svm/svm.pdf

http://www.cs.cmu.edu/~guestrin/Class/10701-S06/Slides/svms-s06.pdf

MML book

Lagrangian Duality for Dummies, David Knowles

PRML Chapter 7

Naive Bayes

Suppose our training set consists of data samples D={(x1,y1),....,(xN,yN)},xiRd, where D={(xi,yi)} are realizations of a random sample that follows unknown joint distribution P(X,Y).

Assumptions:

  1. Features are conditionally independent (Naive bayes assumption): P(X|Y)=j=1dP(Xj|Y)

  2. MLE assumption: Random sample is identically distributed.

  3. Positional independence: The position of features does not matter (used in Multinomial case).

By applying bayes rule (applying on distribution P() to make things general), we have:

P(Y|X)=P(X,Y)P(X)=P(X|Y)P(Y)P(X)

By substituting the assumption:

P(Y|X)=j=1dP(Xj|Y)P(Y)P(X)

Since the probability distribution P(X) characterised by FX(x) is constant for any given x, we can drop it from the equation because it only changes P(Y|X) by a proportion:

P(Y|X)P(Y)j=1dP(Xj|Y)

Our goal is to find a class y^ that maximize the probability given input X=x:

y^=argmaxyj=1dlogPXj|Y(xj|y)+logPY(y)
Read more »

Logistic Regression

Suppose we have training examples D={(x1,y1),....,(xN,yN);xiRd}, our goal is to make decision about the class of new input x. The logistic regression does this by learning from a training set, a vector of bias and a matrix of weights.

Binary-Class

In binary class problem, our target Y takes values {0,1}. To model the distribution P(Y|X;w,b), we apply sigmoid function on the dot product of weights and inputs which transform the output to a value between [0,1] (one criteria for probability):

z=xTw+b

y=σ(z)

To make sure that class random variable Y's conditional pmf sums to 1:

P(Y=1|X=x;w,b)=11+ez=p

P(Y=0|X=x;w,b)=111+ez=ez1+ez=1p

Then, it is equivalently to express this conditional pmf as Bernoulli pmf:

pY|X(y|x;w,b)=py+(1p)1y

If we have the conditional pmf of Y given X=x, then we can use simple decision rule to make decisions:

y^={P(Y=1|X=x)>0.5,1P(Y=1|X=x)0.5,0

Read more »

Gradient Boosting Decision Trees

Boosting Trees

Regression and classification trees partition the space of all joint predictor variable values into disjoint regions Rj,j=1,2,....,J as represented by the terminal nodes of the tree. A constant cj is assigned to each such region and the prediction rule is:

xRjf(x)=cj

Thus, a tree can be formally expressed as:

T(x;θ)=j=1JcjI[xRj]

with parameters θ={Rj,cj}1J. The parameters are found by iteratively solving minimizing problem:

  1. Given Rj, we solve c^j by simply taking the average or majority class.
  2. Rj is found by iterating over all possible pairs of feature and splitting point.

The boosted tree model is a sum of such trees induced in a forward stagewise manner:

fM(x)=m=1MT(x;θm)

At each step, one must solve:

θ^m=argminθmi=1NL(yi,fm1(xi)+T(xi;θm))

Gradient Boosting

In general, it is hard to directly take partial derivatives w.r.t the tree's parameters, thus, we take partial derivatives of tree predictions f(xi).

Fast approximate algorithms for solving the above problem with any differentiable loss criterion can be derived by analogy to numerical optimization. We first start with general case. The loss in using f to predict y on the training data is:

L(f)=i=1NL(yi,f(xi))

The goal is to minimize L(xi) w.r.t f, where here f(x) is constrained to be a sum of trees fM(x).

We first start with general case where f can be any parameters or numbers. In this case, we have N samples, thus, f={f(x1),....,f(xN)}RN.

Then, the gradient of objective w.r.t f=fm1 which is the current model is:

gm=fL(f)=<L(f)f(x1),....,L(f)f(xN)>

Then this gradient points at the direction of steepest increase, it tells us how we can change our current predictions to increase our loss. Since we want to minimize the objective, we would like to adjust our current predictions to the direction of steepest decrease:

hm=ρmgm

Where ρm is the step size for current model and it is minimizer of:

ρm=argminρL(fm1ρgm)

The current solution is then updated as

fm=fm1ρmgm

If fitting the training data (minimizing the loss) is our ultimate goal, then the above update rule can solve our problem by adding the negative gradient at each iteration. However, our ultimate goal is to generalize to new data, copying and pasting training data exactly is not what we want. One possible solution is to learn the update ρmgm by fitting a simple decision tree:

θ^m=argminθi=1N(gimT(xi;θ))2

That is, we fit a regression tree T to the negative gradient values.

Algorithm

  1. We first start by a constant model (model that predict constants) which is a single terminal node tree.
  2. For all samples new targets are generated to be the negative gradient of the loss function w.r.t the current model prediction fm1.
  3. Fit a regression tree to minimize the MSE between new target (negative gradient) and current prediction.

Discussions

Regularization

For numbers of gradient boosting rounds M, the loss can be made arbitrarily small. However, fitting the data too well can lead to overfitting which degrades the risk on future predictions.

Shrinkage

Controlling the value of M is not the only possible regularization strategy, we can add penalty terms to the loss function that penalize large M or we can weight subsequent trees. The simplest implementation of shrinkage in the context of boosting is to scale the contribution of each tree by a factor of 0<v<1, when it is added to the current approximation:

fm(x)=fm1(x)+vj=1JcjmI[xRjm]

The parameter v can be regarded as controlling the learning rate of the boosting procedure. Smaller values of v result in larger training error for the same number of iterations M but might have better generalization. Thus, both v and M control prediction risk on the training data. v and M tradeoff each other, therefore in practice, it is best to set v small and control M by early stopping.

Subsampling

We know that bootstrap averaging (bagging) improves the performance of a noisy classifier through averaging (reduce variance). We can exploit the same idea in gradient boosting.

At each iteration, we sample a fraction η of the training observations without replacement, and grow the next tree using that subsample. A typical value of η is 0.5, although for large sample size N, we can have smaller η.

Ref

https://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf

Decision Trees (CART)

Tree based methods partition the feature space into a set of rectangles, and then fit a simple model (i.e constant) in each one. We focus on CART in this post.

Suppose we have dataset D={(x1,y1),....,(xN,yN);xiRd}. The algorithm needs to automatically decide on the splitting variables and splitting points and also what shape the tree should have.

Regression Trees

In this scenario, our response variable Y is continuous. Suppose first that we have a partition into M regions R1,....,RM and we define the model prediction as:

y^=m=1McmI[xiRm]

By minimizing the mean square loss 121Ni=1N(yiy^i)2, we have:

Lcm=1Ni=1N(yim=1McmI[xiRm])I[xiRm]=1Nmi=1N(yiI[xiRm])cmc^m=1Nmi=1N(yiI[xiRm])

Thus, the best estimate c^m in each region is the average training responses in that region w.r.t mean square error:

c^m=1Nmi=1NyiI[xiRm]

Where Nm=i=1NI[xiRm], is total training examples in region Rm.

Read more »

Heaps

The binary heap data structure is an array object that we can view as a nearly complete binary tree. Each node of the tree corresponds to an element of the array. The tree is completely filled on all levels except possibly the lowes, which is filled from the left up to a point that is the elements in the subarray A[n2+1...n] are all leaves.

An array A that represents a heap is an object with two attributes:

  1. A.length: gives the number of elements in the array. A[1:A.length].
  2. A.heap_size: gives how many elements in the heap are stored within array A. A[1:A.heap_size:A.length]

Given the index i of a node, we can easily compute the indices of its parent, left child and right child by:

  1. parent: i2 (by shifting right 1 bit i >> 1)
  2. left child: 2i (by shifting left 1 bit i << 1)
  3. right child: 2i+1 (by shifting left 1 bit and add 1 i << 1 + 1)

There are two types of binary heap:

  1. Max heap: satisfies the max heap property that for every node i other than the root A[parent(i)]A[i], that is the maximum value in the array is stored in the root and the subtree rooted at a node contains values no larger than that contained at the node itself (heap sorts).
  2. Min heap: satisfies the min heap property that is organized in the opposite way, for every node i other than the root A[parent(i)A[i]] (priority queues)


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class Heap:
def __init__(self, A, heap_size):
# A may not be a heap, call build_min_heap or build_max_heap to convert this instance to heap
assert isinstance(A, list)
self.heap_size = heap_size
self.length = len(A)
self._A = A

def left(self, i):
return i << 1

def right(self, i):
return i << 1 + 1

def parent(self, i):
return i >> 1

def __getitem__(self, index):
return self._A[index]

def append(self, val):
self._A.append(val)

Read more »

Backpropagation In Convolutional Neural Networks

Cross Correlation

Given an input image I and a filter K of dimensions k1×k2, then the cross correlation operation is defined as:

(IK)ij=m=0k21n=0k11I(i+m,j+n)K(m,n)

Convolution

(IK)ijd=m=0k21n=0k11I(im,jn)K(m,n)

which is equivalent to cross correlation with flipped kernel (i.e flipped 180 degree)

(IK)ij=m=0k21n=0k11I(i+m,j+n) rot180(K(m,n))

Read more »

Statistics Inference

Sampling and Statistics

In a typical statistical problem, we have pX(x),fX(x) unknown. Our ignorance about the pX(x),fX(x) can roughly be classified in one of two ways:

  1. f(x) or pX(x) is completely unknown.
  2. The form of fX(x) or pX(x) is known down to an unknown parameter vector θ.

We will focus on the second case. We often denote this problem by saying that the random variable X has density or mass function of the form fX(x;θ) or pX(x;θ), where θΩ for a special parameter space Ω. Since θ is unknown, we want to estimate it.

In the process, our information about case 1 or 2 comes from a sample on X. The sample observations have the same distribution as X, and we denote them as X1,....,Xn where n is the sample size. When the sample is actually drawn, we use lower case letters to represent the realization x1,...,xn of random samples.

Once the sample is drawn, then t is called the realization of T, where t=T(x1,....,xn).

Read more »

Value Function Approximation (Algorithms)

Approximate Value Iteration

Population Version

Recall that the procedure for VI is:

Vk+1TπVk Vk+1TVk

One way to develop its approximation version is to perform each step only approximately (i.e find Vk+1F) such that:

Vk+1TVk

Where T can be T or Tπ.

We start from a V0F, and then at each iteration k of AVI, we solve (i.e p is often 2):

Vk+1argminVFVTVkp,μp

At each iteration, TVk (our target) may not be within F anymore even though VkF, we may have some approximation error at each iteration of AVI. The amount of error depends on how expressive F is and how much T can push a function within F outside that space.

Batch Version

The objective of AVI cannot be computed because:

  • μ is unknown
  • Environment R,P is often not available, thus TQk cannot be computed.

If we have samples in the form of (X,A,R,X), we can compute the unbiased sample of TQk:

T^πQk=R+γQ(X,A) T^Qk=R+γmaxaAQ(X,a)

Where Aπ(|X)


The question is: can we replace TQk with T^Qk?

Given any Z=(X,A) ET^Qk[|Q(Z)T^Qk(Z)|2|Z]=ET^Qk[|Q(Z)TQk(Z)+TQk(Z)T^Qk(Z)|2|Z]=ET^Qk[|Q(Z)TQk(Z)|2|Z]+ET^Qk[|TQk(Z)T^Qk(Z)|2|Z]+2ET^Qk[(Q(Z)TQk(Z))(TQk(Z)T^Qk(Z))|Z]=ET^Qk[|Q(Z)TQk(Z)|2|Z]+ET^Qk[|TQk(Z)T^Qk(Z)|2|Z]

Since Zμ:

Eμ[ET^Qk[|Q(Z)TQk(Z)|2|Z]+ET^Qk[|TQk(Z)T^Qk(Z)|2|Z]]=Eμ,T^Qk[|Q(Z)TQk(Z)|2]+Eμ,T^Qk[|TQk(Z)T^Qk(Z)|2]=Q(Z)TQk(Z)2,μ2+Eμ[Var[T^Qk(Z)|Z]]

Since the expectation of variance term does not depend on Q, the solution to the surrogate objective is the same as the original objective:

argminQFEμ,T^Qk[|Q(Z)T^Qk(Z)|2]=argminQFQ(Z)TQk(Z)2,μ2

Similar to ERM, we do not know the environment dynamics R,P and distribution μ. We can use samples and estimate the expectation:

1Ni=1N|Q(Xi,Ai)T^Qk(Xi,Ai)|2=QT^Qk2,Dn2

This is the basis of DQN.

Bellman Residual Minimization

Population Version

Recall that:

V=TπVV=Vπ

Under FA, we may not achieve this exact equality, instead:

VVπ

Thus, we can formulate our objective:

V=argminVFVTπVp,μp=BR(V)p,μp

By minimizing this objective (p is usually 2), we have Bellman Residual Minimization.

This procedure is different from AVI in that we do not mimic the iterative process of VI (which is convergent in the exact case without any FA), but instead directly go for the solution of fixed-point equation (VTπV instead of VTπVk).

If there exists a VF that makes VTπV2,μ2=0 and if we assume μ(x)>0,xχ, we can conclude that V(x)=Vπ(x),xχ so V=Vπ.

Batch Version

Similar to AVI, we may want to replace TV or TQ by T^Q. Thus, our empirical objective is:

Q=argminQF1Ni=1N|Q(Xi,Ai)T^πQ(Xi,Ai)|2=QT^πQ2,Dn2

Using Dn={(Xi,Ai,Ri,Xi)}i=1N

We can see that Q appears in both T^πQ and Q, which is different from AVI and ERM. This causes an issue: the minimizer of QTπQ2,μ2 and QT^πQ2,μ2 are not necessarily the same for stochastic dynamics.


To see this, for any Q and Z=(X,A) we compute:

ET^πQ[|Q(Z)T^Q(Z)|2|Z]

Then:

ET^πQ[|Q(Z)T^πQ(Z)|2|Z]=ET^πQ[|Q(Z)TπQ(Z)+TπQ(Z)T^πQ(Z)|2|Z]=ET^πQ[|Q(Z)TπQ(Z)|2|Z]+ET^πQ[|TπQ(Z)T^πQ(Z)|2|Z]+2ET^πQ[(Q(Z)TπQ(Z))(TπQ(Z)T^πQ(Z))|Z]=ET^πQ[|Q(Z)TπQ(Z)|2|Z]+ET^πQ[|TπQ(Z)T^πQ(Z)|2|Z]

Since Zμ, the first term is:

ET^πQ,μ[|Q(Z)TπQ(Z)|2]=QTπQ2,μ2

The second term is:

Eμ[ET^πQ[|TπQ(Z)T^πQ(Z)|2|Z]]=Eμ[Var[T^πQ(Z)|Z]]

We can see that the variance term Var[T^πQ(Z) depends on Q, as we minimize the objective w.r.t Q in stochastic dynamical systems (for deterministic ones, it is zero), we have E[Var[Q(Z)|Z]], so the minimizer of the batch version objective is not the same as population version for BRM in stochastic dynamics.

Projected Bellman Error

From BRM, we know that even though VF, TπV may not be in F. Thus, a good approximator VF should have distance to TπV small. Thus, we want to find VF such that V is the projection of TπV onto the space F.

We want to find a VF such that:

V=F,μTπV

Where F,μ is the projection operator.

TRhe projectio operator F,μ is a linear operator that takes VB(χ) and maps it to closest point on F measured according to its L2(μ) norm:

F,μVargminVFVV2,μ2

It has some properties:

  1. The projection belongs to function space F: F,μVF
  2. If VF, the projection is itself: VFF,μV=V
  3. The projection operator onto a subspace is a non-expansion: F,μV1F,μV22,μ2V1V22,μ2

Population Version

We can define a loss function based on V=F,μTπV:

VF,μTπV2,μ2

This is called Projected Bellman Error or Mean Square Projected Bellman Error.

We find the value function by solving the following optimization problem:

V=argminVFVF,μTπV2,μ2

Since VF, the projection operator is linear:

VF,μTπV=F,μVF,μTπV=F,μ(VTπV)=F,μ(TπVV)=F,μ(BR(V))

So the objective can be rewritten as:

V=argminVFF,μBR(V)2,μ2

Which is the norm of the projection of the Bellman Residual onto F.


We can think of solving the projected bellman error objective as solving the two coupled optimization problems:

  1. Find the projection point given value function V: V=argminVFVTπV2,μ2
  2. Find the value function V on F that is closest to the projection point V: V=argminVFVV2,μ2

Least Square Temporal Difference Learning (Population Version)

If F is a linear FA with basis functions ϕ1,....,ϕp:

F:{xϕ(x)Tw;wRp}

We can find a direct solution to the PBE objective, that is we want to find VF s.t:

V=F,μTπV

We assume that:

  • χ is finite and |χ|=N, Np, each ϕi=[ϕi(x1)....ϕi(xN)]T is a N-dimentional vector.

Then, we define ΦN×p as the matrix of concatenating all features:

Φ=[ϕ1....ϕp]N×p

The value function VF is then:

VN×1=ΦN×pwp×1

Our goal is to find a weight vector wRp such that:

V=F,μTπVΦw=F,μTπ(Φw)

Let M=diag(μ), since:

V2,μ2=<V,V>μ=xχ|V(x)|2μ(x)=VTMV

<V1,V2>μ=xχV1(x)V2(x)=V1TMV2

Then, the projection operator onto the linear F would be:

F,μV=argminVFVV2,μ2=argminwRpΦwV2,μ2=argminwRp(ΦwV)TM(ΦwV)

By taking the derivative and setting it to zero (assuming that ΦTMΦ is invertible):

F,μVw=ΦTM(ΦwV)=0 w=(ΦTMΦ)1ΦTMV

Then the projection is:

F,μV=Φw=Φ(ΦTMΦ)1ΦTMV

Since TπV=rπ(x)+γxχ,aAPπ(x|x,a)π(a|s)V(x), we can write it in vector form for all states:

(TπV)N×1=rN×1π+γPN×NπVN×1 (TπΦw)N×1=rN×1π+γPN×NπΦN×pwp×1

Substitute the equation of TπΦw into V in the projection equation:

F,μTπΦw=Φw=Φ(ΦTMΦ)1ΦTM(TπΦw) Φw=[Φ(ΦTMΦ)1ΦTM][rπ+γPπΦw]

Multiply both sides by ΦTM and simply:

ΦTMΦw=[ΦTMΦ(ΦTMΦ)1ΦTM][rπ+γPπΦw] ΦTMΦw=ΦTM[rπ+γPπΦw] ΦTMΦwΦTM[rπ+γPπΦw]=0 ΦTM[rπ+γPπΦwΦw]=0 wTD=[ΦTM(ΦγPπΦ)]1ΦTMrπ

The solution wTD is called Least Square Temporal Difference Method (LSTD).

Notice that the term rπ+γPπΦwΦw=TπVV=BR(V). Hence:

ΦTM(BR(V))=0 <ϕi,BR(V)>=0

Thus, LSTD finds a w s.t the Bellman Residual is orthogonal to the basis of F which is exactly what we want.

Batch Version

We have showed that the solution to V=F,μTπV with linear FA V=Φw is:

wTD=Ap×p1bp×1

Where:

  • A1=ΦTM(ΦγPπΦ)]1
  • b=ΦTMrπ

Expand A and b in terms of summations, we have:

Aij=n=1NΦinTμ(n)(ΦγPπΦ)ij bi=n=1NΦinTμ(n)rnπ

Thus, in terms of current state x and next state x:

A=xχμ(x)ϕ(x)[ϕ(x)γxχPπ(x|x)ϕ(x)]T=Eμ[ϕ(X)[ϕ(X)γEPπ[ϕ(X)]]T]

b=xχμ(x)ϕ(x)rπ(x)=Eμ[ϕ(X)rπ(X)]

Given data set Dn={Xi,Ri,Xi}i=1M with Xiμ, XPπ(|Xi) and RiRπ(|Xi), we define the empirical estimator A^n,b^n as:

A^n=1Mi=1Mϕ(Xi)[ϕ(Xi)γϕ(Xi)]T b^n=1Mi=1Mϕ(X)Ri


We can use LSTD to define an approximate policy iteration procedure to obtain a close to optimal policy (LSPI):

Semi-Gradient TD

Suppose that we know the true value function Vπ and we want to find an approximation V^, parameterized by w. The population loss:

V=argminV^F12VπV^2,μ2

Using samples Xtμ, we can define an SGD procedure that updates wt as follows:

wt+1wtαtwt[12|Vπ(Xt)V^(Xt;wt)|2]=wt+αt(Vπ(Xt)V^(Xt;wt))wtV^(Xt;wt)

If we select proper step size αt, then the SGD converges to the stationary point.

When we do not know Vπ, we can use bootstrapped estimate (TD estimate T^πVt) instead:

wt+1wt+αt(T^πVtV^(Xt;wt))wtV^(Xt;wt) wt+1wt+αt(Rt+V^(Xt;wt)V^(Xt;wt))wtV^(Xt;wt)

The substitution of Vπ(Xt) with T^πVt(Xt) does not follow from the SGD of any loss function. The TD update is note a true SGD update, that is, we call it a semi-gradient update.

题目库: [94, 144, 145, 226, 110, 105, 107, 543, 113, 116, 129, 108, 617, 404, 222, 'jz40', 912, 1046, 451, 703, 206, 2 , 21, 19, 692, 24, 83, 82, 876, 328, 143, 86, 61, 92, 160]

Trees

94 Easy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Definition for a binary tree node.
# class TreeNode:
# def __init__(self, val=0, left=None, right=None):
# self.val = val
# self.left = left
# self.right = right
class Solution:
def inorderTraversal(self, root: TreeNode) -> List[int]:
out = []

def inorder(root):
if not root:
return

inorder(root.left)
out.append(root.val)
inorder(root.right)

inorder(root)
return out

144 Easy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Definition for a binary tree node.
# class TreeNode:
# def __init__(self, val=0, left=None, right=None):
# self.val = val
# self.left = left
# self.right = right
class Solution:
def preorderTraversal(self, root: TreeNode) -> List[int]:
out = []
def preorder(node):
if not node:
return

out.append(node.val)
preorder(node.left)
preorder(node.right)

preorder(root)
return out
Read more »