Reinforcement Learning 12: Eligibility Traces
12 Eligibility Traces
12.1 The $\lambda$-return
average different updates
\[G_t^{\lambda} = (1 - \lambda) \sum_{n=1}^\infty \lambda^{n-1} G_{t:t+n} \\ G_t^{\lambda} = (1 - \lambda) \sum_{n=1}^{T - t - 1} \lambda^{n-1} G_{t:t+n} + \lambda^{T - t - 1} G_t\]the off-line $\lambda$-return algorithm
\[w_{t+1} = w_t + \alpha [G_t^{\lambda} - v(S_t, w_t)] \nabla v(S_t, w_t), \quad t = 0, \dots, T - 1\]12.2 TD($\lambda$)
TD($\lambda$) improves over the off-line $\lambda$-return algorithm in three ways.
- First, it updates the weight vector on every step of an episode rather than only at the end, and thus its estimates may be better sooner.
- Second, its computations are equally distributed in time rather that all at the end of the episode.
- Third, it can be applied to continuing problems rather than just episodic problems.
With function approximation, the eligibility trace is a vector $z_t \in R^d$ with the same number of components as the weight vector $w_t$. Whereas the weight vector is a long-term memorym accumulating over the lifetime of the system, the eligibility trace is a short-term memory, typcially lasting less time than the length of an episode.
Semi-gradient TD($\lambda$) for esitimating $v \approx v_{\pi}$
Input: the policy $\pi$ to be evaluated
Input: a differentiable function $v : S \times R^d \rightarrow R$ such that $v(terminal,\dots) = 0$
Initialize value-function weights $w$ arbitrarily (e.g., $w = 0$)
Repeat (for each epsiode):
- Initialize $S$
- $z \leftarrow 0$
- Repeat (for each step of epsiode):
- Choose
$A \sim \pi( a | S)$
- Take action $A$, observe $R, S’$
- $z \leftarrow \gamma \lambda z + \nabla v(S, w)$
- $\delta \leftarrow R + \gamma v(S’, w) - v(S, w)$
- $w \leftarrow w + \alpha \delta z$
- $S \leftarrow S’$
- Choose
- Until $S’$ is terminal
12.3 n-step Truncated $\lambda$-return Methods
\[G_{t:h}^{\lambda} = (1 - \lambda) \sum_{n=1}^{h-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{h-t-1} G_{t-h}\]12.4 Reding Updates: The Online $\lambda$-return Algorithm
The idea is that, on each time step as you gather a new increment of data, you go back and redo all the updates since the beginning of the current epsiode.
\[w_0^2 \leftarrow w \\ w_1^1 = w_0^1 + \alpha + [G_{0:1}^\lambda + v(S_0, w_0^1)] \nabla v(S_0, w_0^1) \\ w_0^2 \leftarrow w_1^1 \\ w_1^2 = w_0^2 + \alpha + [G_{0:2}^\lambda + v(S_0, w_0^2)] \nabla v(S_0, w_0^2) \\ w_2^2 = w_1^2 + \alpha + [G_{1:2}^\lambda + v(S_1, w_1^2)] \nabla v(S_1, w_1^2) \\ w_0^3 \leftarrow w_2^2 \\ w_1^3 = w_0^3 + \alpha + [G_{0:3}^\lambda + v(S_0, w_0^3)] \nabla v(S_0, w_0^3) \\ w_2^3 = w_1^3 + \alpha + [G_{1:3}^\lambda + v(S_1, w_1^3)] \nabla v(S_1, w_1^3) \\ w_3^3 = w_2^3 + \alpha + [G_{2:3}^\lambda + v(S_2, w_2^3)] \nabla v(S_2, w_2^3) \\\]12.5 True Online TD($\lambda$)
The on-line $\lambda$-return algorithm is currently the best performing temporal-difference algorithm.
\[w_0^0 \\ w_0^1, w_1^1 \\ w_0^2, w_1^2, w_2^2 \\ \dots \dots \dots \dots \\ w_0^T, \dots, \dots, \dots, w_T^T\]Let $w_t = w_t^t$, the strategy is to find a compact, efficient way of computing each $w_t$ from $w_{t-1}$.
For the linear approximation, $v(s, w) = w^T x(s)$:
\[w_{t+1} = w_t + \alpha \delta_t z_t + \alpha (w_t^T x_t - w_{t-1}^T x_t) (z_t - x_t),\]where $x_t = x(S_t)$, and
\[z_t = \gamma \lambda z_{t-1} + (1 - \alpha \gamma z_{t-1}^T x_t) x_t.\]True Online TD($\lambda$) for estimating $w^T x \approx v_{\pi}$
Input: the policy $\pi$ to be evaluated
Initialize value function weights $w$ arbitrarily
Repeat (for each episode):
- Initialize state and obtain initial feature vector $x$
- $z \leftarrow 0$
- $V_{old} \leftarrow 0$
- Repeat (for each step of episode):
- Choose $A \sim \pi$
- Take action $A$, observe $R, x’$(feature vector of the next state)
- $V \leftarrow w^T x$
- $V’ \leftarrow w^T x’$
- $\delta \leftarrow R + \gamma V’ - V$
- $z \leftarrow \gamma \lambda z + (1 - \alpha \gamma z^T x) x$
- $w \leftarrow w + \alpha (\delta + V - V_{old}) z - \alpha (V - V_{old}) x$
- $V_{old} \leftarrow V’$
- $x \leftarrow x’$
- until $x’ = 0$ (signaling arrival at a terminal state)
12.6 Dutch Traces Monte Carlo Learning
The linear version of the gradient Monte Carlo prediction algorithm
\[w_{t+1} = w_t + \alpha [G - w_t^T x_t ] x_t\]Two improvements:
- do some computation during each step and less at the end.
- remove the need to store the feature vectors at each step for use later at the end.
where $F_t = I - \alpha x_t x_t^T$ is a forgetting, or fading, matrix. Now, recursing,
\[w_T = F_{T-1} F_{T-2} \dots F_0 w_0 + \alpha G \sum_{k=0}^{T-1} F_{T-1} F_{T-2} \dots F_{k+1} x_k \\ w_T = a_{T-1} + \alpha G z_{T-1}\]where $a_t$ and $z_t$ are two auxilary memory vectors that can be unpdated incrementally without knowledge of $G$ and with $O(d)$ complexity per time step.
\[z_t = z_{t-1} + (1 - \alpha z_{t-1}^T x_t) x_t \\ a_t = a_{t-1} - \alpha x_t x_t^T a_{t-1}\]12.7 Sarsa($\lambda$)
Sarsa($\lambda$) with binary features and linear function approximation for estimating $q \approx q_\pi$ or $q \approx q_*$
Example 12.1 Traces in Gridworld
Example 12.2 Sarsa($\lambda$) on Mountain Car
True Online Sarsa($\lambda$) for estimating $w^T x \approx q_\pi$ or $q_*$
12.8 Variable $\lambda$ and $\gamma$
Each time step has a different $\lambda$ and $\gamma$, denoted $\lambda_t$ and $\gamma_t$.
We change notation so that $\lambda: S \times A \rightarrow [0, 1]$, such that $\lambda_t = \lambda(S_t, A_t)$, and $\gamma: S \times A \rightarrow [0, 1]$, such that $\gamma_t = \gamma(S_t, A_t)$.
\(\bar{Q}_t = \sum_a \pi(a | S_t) q(S_t, a, w_{t-1})\)
12.9 Off-policy Eligibility Traces
12.10 Watkins’s Q($\lambda$) to Tree-Backup($\lambda$)
\[G_t^{\lambda a} = R_{t+1} + \gamma_{t+1} ( (1 - \lambda_{t+1}) \bar{Q}_{t+1} + \lambda_{t+1}(\sum \dots + \pi(A_{t+1} | S_{t+1}) G_{t+1}^{\lambda a}) )\] \[G_t^{\lambda a} \approx q(S_t, A_t, w_t) + \sum_{k=t}^\infty \delta_k^a \prod_{i=t+1}^k \gamma_i \lambda_i \pi(A_i | S_i)\]12.11 Stable Off-policy Methods with Traces
GTD($\lambda$): the eligibility-trace algorithm analogous to TDC, the better of the two state-value Gradient-TD prediction algorithm.
GQ($\lambda$): the Gradient-TD algorithm for action values with eligibility traces.
HTD($\lambda$): a hybrid state-value algorithm combining aspects of GTD($\lambda$) and TD($\lambda$).
Emphatic TD($\lambda$): an extension of the one-step Emphatic-TD algorithm.