16 Applications and Case Studies
- TD($\lambda$) + NN(Neural network)
- Backprop TD error
198 input units -> hidden units(40-80) -> predicted probability of winning -> TD error
16.2 Samuel’s Checkers Player
linear function approximation + TD + minimax search + alpha-beta cutoff
16.3 Watson’s Daily-Double Wagering
16.4 Optimizaing Memory Control
A reinforcement learning memory controller
MDP: precharge, activate, read, write, noop.
Using Sarsa to learn an action-value function.
States were represented by six integer-valued features.
The linear function approximation was implemented by tile coding with hashing.
16.5 Human-level Video Game Play
DQN, modified in three ways:
- experience replay
- fix the network in the next C updates as the Q-learning target.
- clip the error in [-1, 1]
16.6 Mastering the Game of Go
MCTS + ResNet
16.6.2 AlphaGo Zero
16.7 Personalized Web Services
The objective is to maximize the click-through rate. Contextual bandit problem.
greedy optimization, maximizing only the probability of immeditate clicks.
life-time value optimization, improving the number of clicks users made over multiple visits to a website.