Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 7

Imitation Learning

there are occasions where rewards are dense in time or each iteration is super expensive

→ autonomous driving kind of stuff

So we summon an expert to demonstrate trajectories

Our problem Setup

Untitled

we will talk about three methods below and their goal are...


Behavioral Cloning

seems familiar... a lot like simple supervised learning

we fix a policy class(NN, decision tree..) and estimate policy from “demonstration sets”

let’s go over two notable models

ALVINN

Untitled

ALVINN encounters two major problem of compounding error

→ due to supervised learning’s basic assumption that all data are iid(independent and identically distributed). For our dataset of $(s_0,a_0,s_1,a_1,...)$ are sequential and correlated, error may be accumulated exponentially.