3 Finite Markov Decision Processes

“we introduce the formal problem of finite MDPs which we try to solve in the rest of the book” What???

trade-off between immediate and delayed reward

3.1 The Agent-Environment Interface

3.2 Goals and Rewards

3.3 Returns and Episodes

3.4 Unified Notation for Episodic and Continuing Tasks

3.5 Policies and Value Functions

policy: $\pi(a|s)$ is the probability that $A_t = a$ if $S_t = s$.

value function for policy $\pi$: $v_{\pi}(s)$

action-value function for policy $\pi$: $q_{\pi}(s,a)$, the value of taking action $a$ in state $s$ under a policy $\pi$.

The value functions $v_{\pi}$ and $q_{\pi}$ can be estimated from experience, Monte Carlo methods.

\[v_{\pi}(s) = \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a)][r + \gamma v_{\pi}(s')]\]

Example 3.6 Gridworld

Example 3.7 Golf

3.6 Optimal Policies and Optimal Value Functions

optimal policies $\pi_*$

optimal state-value function $v_*(s) = \max_\pi v_\pi(s)$

optimal action-value function $q_*(s,a) = \max_{\pi} q_\pi(s,a)$

because optimal policies is greedy, so is it possible to learn by EM? for current value function, learn the optimal policy. for current policy, update value function.

Reinforcement Learning 3: Finite Markov Decision Processes