Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 4

→We evaluated policy in model-free situation last time

How can an agent start making good decisions when it doen’t know how the world works: How do we make a “good decision”?


Learning to Control Invovles...

We will considier situation today as either of below

→ MDP model is unknown but can be sampled

→MDP model is known but impossible to use as is unless through sampling

On-Policy and Off-Policy Learning

| On-Policy | - Learn from direct experience


Generalized Policy Iteraton

let us recall policy iteration in model-present case. tou would

we iterate this system $|A|^{|s|}$ times for all policies.