Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 4
→We evaluated policy in model-free situation last time
How can an agent start making good decisions when it doen’t know how the world works: How do we make a “good decision”?
We will considier situation today as either of below
→ MDP model is unknown but can be sampled
→MDP model is known but impossible to use as is unless through sampling
| On-Policy | - Learn from direct experience
let us recall policy iteration in model-present case. tou would
Initialize policy $\pi$
Loop :
→compute $V^\pi$(evaluation)
→update $V^\pi$(improve)
$\pi'(s)=arg maxR(s,a)+\gamma \sum_{s' \in S} P(s'|s,a)V^\pi(s')=argmaxQ^\pi(s,a)$
we iterate this system $|A|^{|s|}$ times for all policies.