Learning From Stream of Data

Learning From Stream of Data (Introduction)

In previous DP settings, we know the transition distribution \(P\) and reward distribution \(R\). Everything about the environment dynamics are given. However, in RL setting, we have no access to the environment dynamics, instead, we observe data from agent interacting with its environment.

A stream of data (rollout or trajectory):

\[X_1, A_1, R_1, X_2, A_2, R_2, ...\]

with \(A_t \sim \pi(\cdot | X_t), X_{t+1} \sim P(\cdot | A_t, X_t), X_1 \sim \rho(\cdot), R_t \sim R(\cdot | X_t, A_t)\)

The questions are:

  1. How can we learn a value of a policy \(\pi\)?
  2. How can we find the optimal policy \(\pi^{*}\)?

Stochastic Approximation of Mean

Let \(Z_1, ..., Z_t\) be i.i.d R.V drawn from distribution \(v\), how can we estimate the expectation of \(Z_i\)? One way is stochastic approximation.

Define:

\[\theta_{t+1} = (1 - \alpha_t)\theta_t + \alpha_t Z_t\]

Note that \(\theta_{t+1}\) is a random variable, because \(Z_t\) is a random variable.

Expectation

Assume \(\alpha_t = \alpha\) is fixed, \(E[Z_t] = m\):

\[\begin{aligned} E[\theta_{t+1}] &= E[(1 - \alpha) \theta_t + \alpha Z_t]\\ &= (1 - \alpha) E[\theta_t] + \alpha E[Z_t]\\ &= (1 - \alpha) E[\theta_t] + \alpha m\\ \end{aligned}\]

Let \(E[\theta_t] = \bar{\theta_t}\), then:

\[E[\theta_{t+1}] = (1 - \alpha) E[\theta_t] + \alpha m \implies \bar{\theta_{t+1}} = (1 - \alpha) \bar{\theta_t} + \alpha m\]

Let \(\theta_0 = 0\), that is, \(\bar{\theta_0} = 0\):

\[\begin{aligned} \bar{\theta_{1}} &= \alpha m\\ \bar{\theta_{2}}&= (1 - \alpha) \alpha m + \alpha m\\ ...\\ ...\\ \bar{\theta_{t}}&= \alpha \sum_{i=0}^{t-1} (1 - \alpha)^i m = \frac{\alpha m (1 - (1 - \alpha)^t)}{1 - (1 - \alpha)} = m[1 - (1 - \alpha)^t]\\ \end{aligned}\]

As \(t \rightarrow \infty\), if \(0 \geq \alpha < 1\)

\[lim_{t \rightarrow \infty} \bar{\theta_{t}} = lim_{t \rightarrow \infty} m[1 - (1 - \alpha)^t] = m\]

This implies \(\theta_t\) converges to \(m\) in expectation

Variance

Since all \(Z_i\) are i.i.d:

\[V[\theta_{t+1}] = (1 - \alpha)^2 V[\theta_t] + \alpha^2 V[Z_t] \geq \alpha^2 V[Z_t] = \alpha^2 \theta^2\]

We can show that:

\[lim_{t \rightarrow \infty} V[\theta_t] = \frac{\alpha \theta^2}{2 - \alpha}\]

For a constant \(\alpha\), the variance \(\theta_t\)is not going to converge to zero. This means that \(\theta_t\) fluctuates around its mean.

SA Condition

In order to make \(\theta_t\) converge, we need a time dependent \(\alpha_t \rightarrow 0\) with some schedule:

\[\sum_{t=0}^{\infty} \alpha_t = \infty \]

\[\sum_{t=0}^{\infty} \alpha_t^2 < \infty \]

\(\epsilon\) - Greedy policy

Problem With Greedy Policy

By selecting

\[a \rightarrow \pi_g (x; Q) = argmax_{a \in A} Q(x, a)\]

We would choose the optimal action at state \(x\). However, if \(Q(x, a)\) is inaccurate estimate of \(Q^{\pi} (x, a)\), we may select a suboptimal action. It is also possible we get stuck in choosing this action forever, without any chance to improve its estimate.

Solution

One solution for this problem it to force the agent regularly picks actions other than the one suggested by the greedy policy.

Thus, for \(\epsilon \geq 0, x \in X\), a function \(\hat{V}\), we define \(\pi_{\epsilon}\) as:

\[ \pi_\epsilon (\cdot | x; \hat{V})= \begin{cases} \pi_g(x; \hat{V}), \text{w. p } 1 - \epsilon \\ Uniform(A), \text{w. p } \epsilon \\ \end{cases} \]

  • The uniform choice of action in the \(\epsilon\)-greedy helps the agent explore all actions, even if the action is seemingly suboptimal.
  • The greedy part of its action select mechanism exploits the current knowledge about the reward function, and chooses the action that has the highest estimated reward.

Exploration and Exploitation Tradeoff

Exploiting our knowledge is a reasonable choice when our knowledge about the world is accurate. (ie. If we have \(Q^{*}\) we should always choose the greedy action at each state to have a maximized return)

However, when we have uncertainty about the world, we should not be overconfident of our knowledge and exploit it at all time. (ie. If we only have \(\hat{Q}\) of \(Q^{*}\), it is possible that \(argmax_{a \in A} \hat Q (x, a) \neq argmax_{a \in A} Q^{*} (x, a)\)) But, instead explore other available actions, which might happen to be better.

The tradeoff between exploration and exploitation is a major topic in RL and is an area of active research.

Off Policy and On Policy Sampling Scenarios

Suppose we have a rollout or trajectory following some policy \(\pi_b\) (ie. an \(\epsilon\)-greedy policy):

\[X_1, A_1, R_1, X_2, ....\]

And we have some policy \(\pi\) that we want to evaluate (ie. the greedy policy):

  • If \(\pi_b = \pi\), we are in the on-policy sampling scenario, in which the agent is evaluating the same policy that it is following.
  • If \(\pi_b \neq \pi\), we are in the off-policy sampling scenario, which the agent is evaluating a policy that is different from the one it is following.

Further

From here, we have all we need to start learning from streams of data!