Value Function Approximation (1)

Posted on 2021-06-12 Edited on 2021-08-24 In RL , VFA Views:

Symbols count in article: 5.2k Reading time ≈ 5 mins.

Value Function Approximation (Introduction)

In real-world problem, the state space \(\chi\) or the state-action space \(\chi \times A\) is large that we cannot represent quantities such as the value function or policy exactly.

\(\chi \subseteq \mathbb{R}^d\) with \(d \geq 1\). Exact representation of an arbitrary function on \(\mathbb{R}^d\) on a computer is infeasible.
\(\chi\) is finite, but very large (millions or billions)

We need to approximate those functions using a representation that is feasible to manipulate on a computer. This is called function approximation.

Function approximation also helps with the generalization.

Linear Function Approximation

We may use a linear function approximator defined based on a set of basis functions:

\[\hat{V} (x) = \phi (x)^T w = \sum^p_{i=1} \phi_i (x) w_i\]

with \(w \in \mathbb{R}^p\) and \(\phi: \chi \rightarrow \mathbb{R}^p\)

Any \(\hat{V}\) belongs to the space of functions \(\mathbf{F}\):

\[\mathbf{F} = \{x \mapsto \phi (x)^T w : w \in \mathbb{R}^p\}\]

The function space \(\mathbf{F}\) is called the linear value function space, in this case it is a span of a set of features.

\(L_p (v)\)-norms of Value Function

For a probability distribution \(v \in M(\chi)\), and a measurable function \(V \in \mathbf{F}\), we define the \(L_p (v)\)-norm of \(V\) with \(1 \leq p < \infty\) as:

\[\|V\|^p_{p, v} \triangleq \int_{x \in \chi} |V(x)|^p dv(x)\]

The \(L_{\infty} (\chi)\) norm is:

\[\|V\|_{\infty} \triangleq \sup_{x \in \chi} |V(x)|\]

Goal

Suppose that we happen to know \(V^{\pi}, Q^{\pi}, Q^*, V^*\) and we want to represent it with a function \(V \in \mathbf{F}\), our goal is to find:

\[V \approx V^{\pi}\]

To quantify \(V \approx V^{\pi}\), we have to pick a distance function \(d\) between function \(V\) and \(V^{\pi}\), given such a distance function, we can express our goal as:

\[V = \underset{V \in \mathbf{F}}{\arg\min} \; d(V, V^{\pi})\]

One common used family of distances are based on \(L_p (\mu)\)-norm w.r.t a probability measure \(\mu \in M(\chi)\):

\[V = \underset{V \in \mathbf{F}}{\arg\min} \; \|V - V^{\pi}\|^p_{p, \mu}\] \[Q = \underset{Q \in \mathbf{F}}{\arg\min} \; \|Q - Q^{\pi}\|^p_{p, \mu}\]

A common choice is \(p=2\) which is similar to mean squared loss function in regression.

Batch RL

In general, \(V^{\pi}, Q^{\pi}, Q^*, V^*\) are unknown quantities and need to estimate them using data. We consider the batch data setting:

The data is already collected and we are interested in using it to estimate quantities such as \(Q^{\pi}, Q^{*}\)
Suppose that we have: \[D_N = \{(X_i, A_i, R_i, X^{\prime}_i)\}^N_{i=1}\]

with \((X_i, A_i) \sim \mu\) and \(X^{\prime}_i \sim P(\cdot | X_i, A_i)\) and \(R_i \sim R(\cdot | X_i, A_i)\) (\(\mu\) can be any distribution that the \(L_p (\mu)\)-norm is referring to, it is often the stationary distribution following policy \(\pi\))
The data could also be generated by following a behaviour policy \(\pi_b\).
In the batch setting, the agent does not interact with the environment while it it computing \(Q^{\pi}, Q^{*}\) which contrasted with the online method such as TD or Q-learning where the agent updates its estimate of the value function as each data point arrives.

Batch RL may collect a batch of data, process them, and then collect a new batch of data, possibly based on a policy resulted from the previous batch processing computation.

Empirical Risk Minimization

Suppose, we have a batch of data:

\[D_N = \{(X_i, A_i, G^{\pi}_i (X_i, A_i))\}^{N}_{i=1}\]

Where \(G^{\pi}_i (X_i, A_i)\) is the return starting from state \(X_i\) and taking action \(A_i\) and following the policy \(\pi\) afterwards (\((X_i, A_i) \sim \mu\)). The return can be obtain using initial-state only MC by selecting \((X_i, A_i) \sim \mu\) and then following \(\pi\) until the end of the episode.

Recall the loss function:

\[Q = \underset{Q \in \mathbf{F}}{\arg\min} \; \|Q - Q^{\pi}\|^p_{p, \mu}\]

We do not know the distribution \(\mu\) and only have samples from it.
We do not know \(Q^{\pi}\) and only have unbiased estimate of it at a finite number of data points.

The question is: can we replace \(Q^{\pi}\) with \(G^{\pi}\)?

For any \(Q\), we can decompose:

\[\begin{aligned} E_{\mu, G^{\pi}}[|Q(X, A) - G^{\pi}(X, A)|^2] &= E_{\mu, G^{\pi}}[|Q(X, A) - Q^{\pi} (X, A) + Q^{\pi} (X, A) - G^{\pi} (X, A)|^2]\\ &= E_{\mu, G^{\pi}}[|Q(X, A) - Q^{\pi}(X, A)|^2] + E_{\mu, G^{\pi}}[|Q^{\pi}(X, A) - G^{\pi}(X, A)|^2] + 2 E_{\mu, G^{\pi}}[(Q(X, A) - Q^{\pi} (X, A))(Q^{\pi} (X, A) - G^{\pi} (X, A))]\\ \end{aligned}\]

The first term:

\[E_{\mu, G^{\pi}}[|Q(X, A) - Q^{\pi}(X, A)|^2] = \int_{x, a} |Q(X, A) - Q^{\pi}(X, A)|^2 d\mu(x, a) = \|Q - Q^{\pi}\|^2_{2, \mu}\]

The second term:

\[E_{\mu, G^{\pi}}[|Q^{\pi}(X, A) - G^{\pi}(X, A)|^2] = E_{\mu}[E_{G^{\pi}}[|Q^{\pi}(X, A) - G^{\pi}(X, A)|^2 | X, A]] = E_{\mu}[Var[G^{\pi} (X, A) | X, A]]\]

The third term:

\[\begin{aligned} 2 E_{\mu, G^{\pi}}[(Q(X, A) - Q^{\pi} (X, A))(Q^{\pi} (X, A) - G^{\pi} (X, A))] &= 2E_{\mu}[E_{G^{\pi}} [(Q(X, A) - Q^{\pi} (X, A))(Q^{\pi}(X, A) - G^{\pi} (X, A)) | X, A]]\\ &= 2E_{\mu}[(Q(X, A) - Q^{\pi} (X, A))(Q^{\pi}(X, A) - \underbrace{E[G^{\pi} (X, A)) | X, A]}_{Q^{\pi} (X, A)})]\\ &= 0 \end{aligned}\]

Thus:

\[\begin{aligned} \underset{Q \in \mathbf{F}}{\arg\min} \; E_{\mu, G^{\pi}}[|Q(X, A) - G^{\pi}(X, A)|^2] &= \underset{Q \in \mathbf{F}}{\arg\min} \; \|Q - Q^{\pi}\|^2_{2, \mu} + E_{\mu}[Var[G^{\pi} (X, A) | X, A]]\\ &= \underset{Q \in \mathbf{F}}{\arg\min} \; \|Q - Q^{\pi}\|^2_{2, \mu} \end{aligned}\]

We can see that, if we could find the solution to \(\underset{Q \in \mathbf{F}}{\arg\min} \; E_{\mu, G^{\pi}}[|Q - G^{\pi}|^2]\), the solution would be the same as the minimizer of our target. However, we could not compute the expectation because we do not know \(\mu\) and environment \(R, P\), we only have samples from it. A common solution in ML to address this issue is to use the emprical risk minimization, which prescribes that we solve:

\[\frac{1}{N} \sum^{N}_{i=1} |Q(X_i, A_i) - G^{\pi} (X_i, A_i)|^2 = \|Q - G^{\pi}\|^2_{2, D_n}\]

This is indeed a regression problem with the squared error loss.

Next: AVI, BRM, LSTD etc..