Value Function Approximation (2)

Value Function Approximation (Algorithms)

Approximate Value Iteration

Population Version

Recall that the procedure for VI is:

Vk+1TπVk Vk+1TVk

One way to develop its approximation version is to perform each step only approximately (i.e find Vk+1F) such that:

Vk+1TVk

Where T can be T or Tπ.

We start from a V0F, and then at each iteration k of AVI, we solve (i.e p is often 2):

Vk+1argminVFVTVkp,μp

At each iteration, TVk (our target) may not be within F anymore even though VkF, we may have some approximation error at each iteration of AVI. The amount of error depends on how expressive F is and how much T can push a function within F outside that space.

Batch Version

The objective of AVI cannot be computed because:

  • μ is unknown
  • Environment R,P is often not available, thus TQk cannot be computed.

If we have samples in the form of (X,A,R,X), we can compute the unbiased sample of TQk:

T^πQk=R+γQ(X,A) T^Qk=R+γmaxaAQ(X,a)

Where Aπ(|X)


The question is: can we replace TQk with T^Qk?

Given any Z=(X,A) ET^Qk[|Q(Z)T^Qk(Z)|2|Z]=ET^Qk[|Q(Z)TQk(Z)+TQk(Z)T^Qk(Z)|2|Z]=ET^Qk[|Q(Z)TQk(Z)|2|Z]+ET^Qk[|TQk(Z)T^Qk(Z)|2|Z]+2ET^Qk[(Q(Z)TQk(Z))(TQk(Z)T^Qk(Z))|Z]=ET^Qk[|Q(Z)TQk(Z)|2|Z]+ET^Qk[|TQk(Z)T^Qk(Z)|2|Z]

Since Zμ:

Eμ[ET^Qk[|Q(Z)TQk(Z)|2|Z]+ET^Qk[|TQk(Z)T^Qk(Z)|2|Z]]=Eμ,T^Qk[|Q(Z)TQk(Z)|2]+Eμ,T^Qk[|TQk(Z)T^Qk(Z)|2]=Q(Z)TQk(Z)2,μ2+Eμ[Var[T^Qk(Z)|Z]]

Since the expectation of variance term does not depend on Q, the solution to the surrogate objective is the same as the original objective:

argminQFEμ,T^Qk[|Q(Z)T^Qk(Z)|2]=argminQFQ(Z)TQk(Z)2,μ2

Similar to ERM, we do not know the environment dynamics R,P and distribution μ. We can use samples and estimate the expectation:

1Ni=1N|Q(Xi,Ai)T^Qk(Xi,Ai)|2=QT^Qk2,Dn2

This is the basis of DQN.

Bellman Residual Minimization

Population Version

Recall that:

V=TπVV=Vπ

Under FA, we may not achieve this exact equality, instead:

VVπ

Thus, we can formulate our objective:

V=argminVFVTπVp,μp=BR(V)p,μp

By minimizing this objective (p is usually 2), we have Bellman Residual Minimization.

This procedure is different from AVI in that we do not mimic the iterative process of VI (which is convergent in the exact case without any FA), but instead directly go for the solution of fixed-point equation (VTπV instead of VTπVk).

If there exists a VF that makes VTπV2,μ2=0 and if we assume μ(x)>0,xχ, we can conclude that V(x)=Vπ(x),xχ so V=Vπ.

Batch Version

Similar to AVI, we may want to replace TV or TQ by T^Q. Thus, our empirical objective is:

Q=argminQF1Ni=1N|Q(Xi,Ai)T^πQ(Xi,Ai)|2=QT^πQ2,Dn2

Using Dn={(Xi,Ai,Ri,Xi)}i=1N

We can see that Q appears in both T^πQ and Q, which is different from AVI and ERM. This causes an issue: the minimizer of QTπQ2,μ2 and QT^πQ2,μ2 are not necessarily the same for stochastic dynamics.


To see this, for any Q and Z=(X,A) we compute:

ET^πQ[|Q(Z)T^Q(Z)|2|Z]

Then:

ET^πQ[|Q(Z)T^πQ(Z)|2|Z]=ET^πQ[|Q(Z)TπQ(Z)+TπQ(Z)T^πQ(Z)|2|Z]=ET^πQ[|Q(Z)TπQ(Z)|2|Z]+ET^πQ[|TπQ(Z)T^πQ(Z)|2|Z]+2ET^πQ[(Q(Z)TπQ(Z))(TπQ(Z)T^πQ(Z))|Z]=ET^πQ[|Q(Z)TπQ(Z)|2|Z]+ET^πQ[|TπQ(Z)T^πQ(Z)|2|Z]

Since Zμ, the first term is:

ET^πQ,μ[|Q(Z)TπQ(Z)|2]=QTπQ2,μ2

The second term is:

Eμ[ET^πQ[|TπQ(Z)T^πQ(Z)|2|Z]]=Eμ[Var[T^πQ(Z)|Z]]

We can see that the variance term Var[T^πQ(Z) depends on Q, as we minimize the objective w.r.t Q in stochastic dynamical systems (for deterministic ones, it is zero), we have E[Var[Q(Z)|Z]], so the minimizer of the batch version objective is not the same as population version for BRM in stochastic dynamics.

Projected Bellman Error

From BRM, we know that even though VF, TπV may not be in F. Thus, a good approximator VF should have distance to TπV small. Thus, we want to find VF such that V is the projection of TπV onto the space F.

We want to find a VF such that:

V=F,μTπV

Where F,μ is the projection operator.

TRhe projectio operator F,μ is a linear operator that takes VB(χ) and maps it to closest point on F measured according to its L2(μ) norm:

F,μVargminVFVV2,μ2

It has some properties:

  1. The projection belongs to function space F: F,μVF
  2. If VF, the projection is itself: VFF,μV=V
  3. The projection operator onto a subspace is a non-expansion: F,μV1F,μV22,μ2V1V22,μ2

Population Version

We can define a loss function based on V=F,μTπV:

VF,μTπV2,μ2

This is called Projected Bellman Error or Mean Square Projected Bellman Error.

We find the value function by solving the following optimization problem:

V=argminVFVF,μTπV2,μ2

Since VF, the projection operator is linear:

VF,μTπV=F,μVF,μTπV=F,μ(VTπV)=F,μ(TπVV)=F,μ(BR(V))

So the objective can be rewritten as:

V=argminVFF,μBR(V)2,μ2

Which is the norm of the projection of the Bellman Residual onto F.


We can think of solving the projected bellman error objective as solving the two coupled optimization problems:

  1. Find the projection point given value function V: V=argminVFVTπV2,μ2
  2. Find the value function V on F that is closest to the projection point V: V=argminVFVV2,μ2

Least Square Temporal Difference Learning (Population Version)

If F is a linear FA with basis functions ϕ1,....,ϕp:

F:{xϕ(x)Tw;wRp}

We can find a direct solution to the PBE objective, that is we want to find VF s.t:

V=F,μTπV

We assume that:

  • χ is finite and |χ|=N, Np, each ϕi=[ϕi(x1)....ϕi(xN)]T is a N-dimentional vector.

Then, we define ΦN×p as the matrix of concatenating all features:

Φ=[ϕ1....ϕp]N×p

The value function VF is then:

VN×1=ΦN×pwp×1

Our goal is to find a weight vector wRp such that:

V=F,μTπVΦw=F,μTπ(Φw)

Let M=diag(μ), since:

V2,μ2=<V,V>μ=xχ|V(x)|2μ(x)=VTMV

<V1,V2>μ=xχV1(x)V2(x)=V1TMV2

Then, the projection operator onto the linear F would be:

F,μV=argminVFVV2,μ2=argminwRpΦwV2,μ2=argminwRp(ΦwV)TM(ΦwV)

By taking the derivative and setting it to zero (assuming that ΦTMΦ is invertible):

F,μVw=ΦTM(ΦwV)=0 w=(ΦTMΦ)1ΦTMV

Then the projection is:

F,μV=Φw=Φ(ΦTMΦ)1ΦTMV

Since TπV=rπ(x)+γxχ,aAPπ(x|x,a)π(a|s)V(x), we can write it in vector form for all states:

(TπV)N×1=rN×1π+γPN×NπVN×1 (TπΦw)N×1=rN×1π+γPN×NπΦN×pwp×1

Substitute the equation of TπΦw into V in the projection equation:

F,μTπΦw=Φw=Φ(ΦTMΦ)1ΦTM(TπΦw) Φw=[Φ(ΦTMΦ)1ΦTM][rπ+γPπΦw]

Multiply both sides by ΦTM and simply:

ΦTMΦw=[ΦTMΦ(ΦTMΦ)1ΦTM][rπ+γPπΦw] ΦTMΦw=ΦTM[rπ+γPπΦw] ΦTMΦwΦTM[rπ+γPπΦw]=0 ΦTM[rπ+γPπΦwΦw]=0 wTD=[ΦTM(ΦγPπΦ)]1ΦTMrπ

The solution wTD is called Least Square Temporal Difference Method (LSTD).

Notice that the term rπ+γPπΦwΦw=TπVV=BR(V). Hence:

ΦTM(BR(V))=0 <ϕi,BR(V)>=0

Thus, LSTD finds a w s.t the Bellman Residual is orthogonal to the basis of F which is exactly what we want.

Batch Version

We have showed that the solution to V=F,μTπV with linear FA V=Φw is:

wTD=Ap×p1bp×1

Where:

  • A1=ΦTM(ΦγPπΦ)]1
  • b=ΦTMrπ

Expand A and b in terms of summations, we have:

Aij=n=1NΦinTμ(n)(ΦγPπΦ)ij bi=n=1NΦinTμ(n)rnπ

Thus, in terms of current state x and next state x:

A=xχμ(x)ϕ(x)[ϕ(x)γxχPπ(x|x)ϕ(x)]T=Eμ[ϕ(X)[ϕ(X)γEPπ[ϕ(X)]]T]

b=xχμ(x)ϕ(x)rπ(x)=Eμ[ϕ(X)rπ(X)]

Given data set Dn={Xi,Ri,Xi}i=1M with Xiμ, XPπ(|Xi) and RiRπ(|Xi), we define the empirical estimator A^n,b^n as:

A^n=1Mi=1Mϕ(Xi)[ϕ(Xi)γϕ(Xi)]T b^n=1Mi=1Mϕ(X)Ri


We can use LSTD to define an approximate policy iteration procedure to obtain a close to optimal policy (LSPI):

Semi-Gradient TD

Suppose that we know the true value function Vπ and we want to find an approximation V^, parameterized by w. The population loss:

V=argminV^F12VπV^2,μ2

Using samples Xtμ, we can define an SGD procedure that updates wt as follows:

wt+1wtαtwt[12|Vπ(Xt)V^(Xt;wt)|2]=wt+αt(Vπ(Xt)V^(Xt;wt))wtV^(Xt;wt)

If we select proper step size αt, then the SGD converges to the stationary point.

When we do not know Vπ, we can use bootstrapped estimate (TD estimate T^πVt) instead:

wt+1wt+αt(T^πVtV^(Xt;wt))wtV^(Xt;wt) wt+1wt+αt(Rt+V^(Xt;wt)V^(Xt;wt))wtV^(Xt;wt)

The substitution of Vπ(Xt) with T^πVt(Xt) does not follow from the SGD of any loss function. The TD update is note a true SGD update, that is, we call it a semi-gradient update.