12 Eligibility Traces

12.1 The $\lambda$-return

average different updates

\[G_t^{\lambda} = (1 - \lambda) \sum_{n=1}^\infty \lambda^{n-1} G_{t:t+n} \\ G_t^{\lambda} = (1 - \lambda) \sum_{n=1}^{T - t - 1} \lambda^{n-1} G_{t:t+n} + \lambda^{T - t - 1} G_t\]

the off-line $\lambda$-return algorithm

\[w_{t+1} = w_t + \alpha [G_t^{\lambda} - v(S_t, w_t)] \nabla v(S_t, w_t), \quad t = 0, \dots, T - 1\]

12.2 TD($\lambda$)

TD($\lambda$) improves over the off-line $\lambda$-return algorithm in three ways.

First, it updates the weight vector on every step of an episode rather than only at the end, and thus its estimates may be better sooner.
Second, its computations are equally distributed in time rather that all at the end of the episode.
Third, it can be applied to continuing problems rather than just episodic problems.

With function approximation, the eligibility trace is a vector $z_t \in R^d$ with the same number of components as the weight vector $w_t$. Whereas the weight vector is a long-term memorym accumulating over the lifetime of the system, the eligibility trace is a short-term memory, typcially lasting less time than the length of an episode.

Semi-gradient TD($\lambda$) for esitimating $v \approx v_{\pi}$

Input: the policy $\pi$ to be evaluated

Input: a differentiable function $v : S \times R^d \rightarrow R$ such that $v(terminal,\dots) = 0$

Initialize value-function weights $w$ arbitrarily (e.g., $w = 0$)

Repeat (for each epsiode):

Initialize $S$
$z \leftarrow 0$
Repeat (for each step of epsiode):
- Choose $A \sim \pi( a | S)$
- Take action $A$, observe $R, S’$
- $z \leftarrow \gamma \lambda z + \nabla v(S, w)$
- $\delta \leftarrow R + \gamma v(S’, w) - v(S, w)$
- $w \leftarrow w + \alpha \delta z$
- $S \leftarrow S’$
Until $S’$ is terminal

12.3 n-step Truncated $\lambda$-return Methods

\[G_{t:h}^{\lambda} = (1 - \lambda) \sum_{n=1}^{h-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{h-t-1} G_{t-h}\]

12.4 Reding Updates: The Online $\lambda$-return Algorithm

The idea is that, on each time step as you gather a new increment of data, you go back and redo all the updates since the beginning of the current epsiode.

\[w_0^2 \leftarrow w \\ w_1^1 = w_0^1 + \alpha + [G_{0:1}^\lambda + v(S_0, w_0^1)] \nabla v(S_0, w_0^1) \\ w_0^2 \leftarrow w_1^1 \\ w_1^2 = w_0^2 + \alpha + [G_{0:2}^\lambda + v(S_0, w_0^2)] \nabla v(S_0, w_0^2) \\ w_2^2 = w_1^2 + \alpha + [G_{1:2}^\lambda + v(S_1, w_1^2)] \nabla v(S_1, w_1^2) \\ w_0^3 \leftarrow w_2^2 \\ w_1^3 = w_0^3 + \alpha + [G_{0:3}^\lambda + v(S_0, w_0^3)] \nabla v(S_0, w_0^3) \\ w_2^3 = w_1^3 + \alpha + [G_{1:3}^\lambda + v(S_1, w_1^3)] \nabla v(S_1, w_1^3) \\ w_3^3 = w_2^3 + \alpha + [G_{2:3}^\lambda + v(S_2, w_2^3)] \nabla v(S_2, w_2^3) \\\]

12.5 True Online TD($\lambda$)

The on-line $\lambda$-return algorithm is currently the best performing temporal-difference algorithm.

\[w_0^0 \\ w_0^1, w_1^1 \\ w_0^2, w_1^2, w_2^2 \\ \dots \dots \dots \dots \\ w_0^T, \dots, \dots, \dots, w_T^T\]

Let $w_t = w_t^t$, the strategy is to find a compact, efficient way of computing each $w_t$ from $w_{t-1}$.

For the linear approximation, $v(s, w) = w^T x(s)$:

\[w_{t+1} = w_t + \alpha \delta_t z_t + \alpha (w_t^T x_t - w_{t-1}^T x_t) (z_t - x_t),\]

where $x_t = x(S_t)$, and

\[z_t = \gamma \lambda z_{t-1} + (1 - \alpha \gamma z_{t-1}^T x_t) x_t.\]

True Online TD($\lambda$) for estimating $w^T x \approx v_{\pi}$

Input: the policy $\pi$ to be evaluated

Initialize value function weights $w$ arbitrarily

Repeat (for each episode):

Initialize state and obtain initial feature vector $x$
$z \leftarrow 0$
$V_{old} \leftarrow 0$
Repeat (for each step of episode):
- Choose $A \sim \pi$
- Take action $A$, observe $R, x’$(feature vector of the next state)
- $V \leftarrow w^T x$
- $V’ \leftarrow w^T x’$
- $\delta \leftarrow R + \gamma V’ - V$
- $z \leftarrow \gamma \lambda z + (1 - \alpha \gamma z^T x) x$
- $w \leftarrow w + \alpha (\delta + V - V_{old}) z - \alpha (V - V_{old}) x$
- $V_{old} \leftarrow V’$
- $x \leftarrow x’$
until $x’ = 0$ (signaling arrival at a terminal state)

12.6 Dutch Traces Monte Carlo Learning

The linear version of the gradient Monte Carlo prediction algorithm

\[w_{t+1} = w_t + \alpha [G - w_t^T x_t ] x_t\]

Two improvements:

do some computation during each step and less at the end.
remove the need to store the feature vectors at each step for use later at the end.

\[w_T = w_{T-1} + \alpha [G - w_{T-1}^T x_{T-1} ] x_{T-1} \\ w_T = (I - \alpha x_{T-1} x_{T-1}^T) w_{T-1} + \alpha G x_{T-1} w_T = F_{T-1} w_{T-1} + \alpha G x_{T-1}\]

where $F_t = I - \alpha x_t x_t^T$ is a forgetting, or fading, matrix. Now, recursing,

\[w_T = F_{T-1} F_{T-2} \dots F_0 w_0 + \alpha G \sum_{k=0}^{T-1} F_{T-1} F_{T-2} \dots F_{k+1} x_k \\ w_T = a_{T-1} + \alpha G z_{T-1}\]

where $a_t$ and $z_t$ are two auxilary memory vectors that can be unpdated incrementally without knowledge of $G$ and with $O(d)$ complexity per time step.

\[z_t = z_{t-1} + (1 - \alpha z_{t-1}^T x_t) x_t \\ a_t = a_{t-1} - \alpha x_t x_t^T a_{t-1}\]

12.7 Sarsa($\lambda$)

Sarsa($\lambda$) with binary features and linear function approximation for estimating $q \approx q_\pi$ or $q \approx q_*$

Example 12.1 Traces in Gridworld

Example 12.2 Sarsa($\lambda$) on Mountain Car

True Online Sarsa($\lambda$) for estimating $w^T x \approx q_\pi$ or $q_*$

12.8 Variable $\lambda$ and $\gamma$

Each time step has a different $\lambda$ and $\gamma$, denoted $\lambda_t$ and $\gamma_t$.

We change notation so that $\lambda: S \times A \rightarrow [0, 1]$, such that $\lambda_t = \lambda(S_t, A_t)$, and $\gamma: S \times A \rightarrow [0, 1]$, such that $\gamma_t = \gamma(S_t, A_t)$.

$\bar{Q}_t = \sum_a \pi(a | S_t) q(S_t, a, w_{t-1})$

12.9 Off-policy Eligibility Traces

12.10 Watkins’s Q($\lambda$) to Tree-Backup($\lambda$)

\[G_t^{\lambda a} = R_{t+1} + \gamma_{t+1} ( (1 - \lambda_{t+1}) \bar{Q}_{t+1} + \lambda_{t+1}(\sum \dots + \pi(A_{t+1} | S_{t+1}) G_{t+1}^{\lambda a}) )\] \[G_t^{\lambda a} \approx q(S_t, A_t, w_t) + \sum_{k=t}^\infty \delta_k^a \prod_{i=t+1}^k \gamma_i \lambda_i \pi(A_i | S_i)\]

12.11 Stable Off-policy Methods with Traces

GTD($\lambda$): the eligibility-trace algorithm analogous to TDC, the better of the two state-value Gradient-TD prediction algorithm.

GQ($\lambda$): the Gradient-TD algorithm for action values with eligibility traces.

HTD($\lambda$): a hybrid state-value algorithm combining aspects of GTD($\lambda$) and TD($\lambda$).

Emphatic TD($\lambda$): an extension of the one-step Emphatic-TD algorithm.

Reinforcement Learning 12: Eligibility Traces