Reinforcement Learning 8: Planning and Learning with Tabular Methods
8 Planning and Learning with Tabular Methods
model-based: planning
model-free: learning
8.1 Models and Planning
distribution models
sample models
Two distinct approaches to planning
- state-space planning
- plan-space planning
Random-sample one-step tabular Q-planning
8.2 Dyna: Integrating Planning, Acting and Learning
Tabular Dyna-Q
Example 8.1: Dyna Maze
8.3 When the Model Is Wrong
Example 8.2: Blocking Maze
Example 8.3: Shortcut Maze
- Dyna-Q+: the transition has not been tried in $\tau$ time steps, then planning updates are done as if that transition produced a reward of $r + k\sqrt{\tau}$
Exercise 8.4
8.4 Prioritized Sweeping
Only work back from any state whose value has changed.
Prioritized sweeping for a deterministic environment
Example 8.4: Prioritized Sweeping on Mazes
Example 8.5: Rod Maneuvering
8.5 Expected vs. Sample Updates
- state values or action values
- estimate the value for optimal policy or for an arbitrary given policy
- expected updates sample updates
8.6 Trajectory Sampling
8.7 Real-time Dynamic Programming
Example 8.6: RTDP on the Racetrack
8.8 Planning at Decision Time
background planning: to use planning to gradually improve a policy or value function
decision-time planning: to select an action for the current state
8.9 Heuristic Search
8.10 Rollout Algorithms
8.11 Monte Carlo Tree Search
-
Selection
-
Expansion
-
Simulation
-
Backup
My implementation of MCTS + ResNet for Gomoku: GomokuZero