# Reinforcement Learning 11: Off-policy Methods with Approximation

# 11 Off-policy Methods with Approximation

The challenge of off-policy learning can be divided into two parts, one that arises in the tabular case and one that arises only with function approximation. The first part of the challenge has to do with the target of the update (not to be confused with the target policy), and the second part has to do with the distribution of the updates.

## 11.1 Semi-gradient Methods

## 11.2 Examples of Off-policy Divergence

**Example 11.1 Tsitsiklis and Van Roy’s Counterexample** This example shows that linear function approximation would not work with DP even if the least-squares solution was found at each step.

## 11.3 The Deadly Triad

- Function approximation
- Bootstrapping
- Off-policy training

## 11.4 Linear Value-function Geometry

We can define the distance between value functions using the norm:

```
$$
||v||_\mu^2 = \sum \mu(s) v(s)^2
$$
```

We define a projection operator $\Pi$ that takes an arbitrary value function to the representable function that is closest in our norm:

```
$$
\Pi v = v_w \text{ where } w = argmin ||v-v_w||_\mu^2
$$
```

**The projection matrix**

For a linear function approximation, the projection operation is linear, which implies that it can be represented as an $|S| \times |S|$ matrix: \(\Pi = X (X^T D X)^{-1} X^T D,\) where, $D$ denotes the diagonal matrix with the $\mu(s)$ on the diagonal, and X denotes the $S \times d$ matrix whose rows are the feature vectors $x(s)^T$, one for each state $s$.

Value error: $VE(w) = || v_w - v_\pi ||_\mu^2$ Bellman error: The expectation of the TD error. Mean Square Projected Bellman Error

## 11.5 Stochastic Gradient Descent in the Bellman Error

In this and the next section we explore the origins and limits of the most popular proposed objective function, that based on the Bellman error introduced in the previous section. Although this has been a popular and influential approach, then conclusion that we reach here is that it is a misstep and yields no good learning algorithms.

**A-split example, showing the naivete of the naive residual gradient algorithm**
The true values do not have the smallest TDE. Smallest TDE values are much smoother than the true values.
There are two ways to make the residual gradient algorithm work.

**A-presplit example, a counterexample for the BE**

## 11.6 The Bellman Error is Not Learnable

same parameter, different model, different optimal VE -> VE is not learnable
but the optimal parameter is learnable
**Counterexample to the learnability of the BE and its minima**

## 11.7 Gradient-TD Methods

```
$$
PBE(w) = || \Pi \delta_w||_\mu^2 \\
\nabla PBE(w) = 2 \nabla [X^T D \delta_w]^T (X^T D X)^{-1}(X^T D \delta_w) \\
\nabla PBE(w) = 2 E[\rho_t (\gamma x_{t+1} - x_t) x_t^T] E[x_t x_t^T]^{-1} E[\rho_t \delta_wx_t]
$$
```

Cannot simply sample both of these expectations, biased estimator. One idea is to estimate the three expectations separately, but too slow. A better way is approximating(estimating) the expectations, store and improve the approximation.

```
$$
v \approx E[x_t x_t^T]^{-1} E[\rho_t \delta_wx_t] \\
v_{t+1} = v_{t} + \beta (\rho_t - v_t^T x_t) x_t \\
w_{t+1} = w_t + \alpha \rho_t(x_t - \gamma x_{t+1}) x_t^T v_t
$$
```

This algorithm is called GTD2.

## 11.8 Emphatic-TD Methods

## 11.9 Reducing Variance

Off-policy learning is inherently of greater variance than on-policy learning.

## 11.10 Summary

In this chapter we divided the challenge of off-policy learning into two parts. The first part, correcting the targets of learning for the behavior policy, is straightforwardly dealt with using techniques devised earlier for the tabular case, albeit at the cost of increasing the variance of the updates and thereby shlowing learning. High variance will probably always remains a challenge for off-policy learning.

The second part of the challenge of off-policy learning emerges as the instability of semi-gradient TD methods that involve bootstrapping.