Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 7

Imitation Learning

there are occasions where rewards are dense in time or each iteration is super expensive

→ autonomous driving kind of stuff

So we summon an expert to demonstrate trajectories

Our problem Setup

Untitled

we will talk about three methods below and their goal are...

Behavoiral Cloning : learn directly from teacher’s policy
Inverse RL : can we extract reward $R$ from demonstration?
Apprenticeship Learning : can we use $R$ on making good policy?

Behavioral Cloning

seems familiar... a lot like simple supervised learning

we fix a policy class(NN, decision tree..) and estimate policy from “demonstration sets”

let’s go over two notable models

ALVINN

Untitled

ALVINN encounters two major problem of compounding error

→ due to supervised learning’s basic assumption that all data are iid(independent and identically distributed). For our dataset of $(s_0,a_0,s_1,a_1,...)$ are sequential and correlated, error may be accumulated exponentially.