Lecture 4 | Notion

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 4

→We evaluated policy in model-free situation last time

How can an agent start making good decisions when it doen’t know how the world works: How do we make a “good decision”?

Optimization : we want maximal expected rewards
Delayed Consequences : may take time to realize wheter previous action aws good or bad
Exploration : requires some exploration to learn possible higher solution

We will considier situation today as either of below

→ MDP model is unknown but can be sampled

→MDP model is known but impossible to use as is unless through sampling

| On-Policy | - Learn from direct experience

Estimate and evaluate policy from expereince obtained from that policy | | --- | --- | | Off-Policy | - Learn to estimate and evaluate policy from expereince obtained from a different policy |

Generalized Policy Iteraton

let us recall policy iteration in model-present case. tou would

we iterate this system $|A|^{|s|}$ times for all policies.