Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 7
there are occasions where rewards are dense in time or each iteration is super expensive
→ autonomous driving kind of stuff
So we summon an expert to demonstrate trajectories
Our problem Setup
we will talk about three methods below and their goal are...
seems familiar... a lot like simple supervised learning
we fix a policy class(NN, decision tree..) and estimate policy from “demonstration sets”
let’s go over two notable models
ALVINN
ALVINN encounters two major problem of compounding error
→ due to supervised learning’s basic assumption that all data are iid(independent and identically distributed). For our dataset of $(s_0,a_0,s_1,a_1,...)$ are sequential and correlated, error may be accumulated exponentially.