8 Planning and Learning with Tabular Methods

model-based: planning

model-free: learning

8.1 Models and Planning

distribution models

sample models

Two distinct approaches to planning

Random-sample one-step tabular Q-planning

Tabular Dyna-Q

Example 8.1: Dyna Maze

Example 8.2: Blocking Maze

Example 8.3: Shortcut Maze

Dyna-Q+: the transition has not been tried in $\tau$ time steps, then planning updates are done as if that transition produced a reward of $r + k\sqrt{\tau}$

Exercise 8.4

Only work back from any state whose value has changed.

Prioritized sweeping for a deterministic environment

Example 8.4: Prioritized Sweeping on Mazes

Example 8.5: Rod Maneuvering

Example 8.6: RTDP on the Racetrack

background planning: to use planning to gradually improve a policy or value function

decision-time planning: to select an action for the current state

My implementation of MCTS + ResNet for Gomoku: GomokuZero