Lecture 1 | Notion

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 1

How an intelligent agent learns to make good sequences of decisions according to repeated interactions with World

Key aspects of RL

Optimization

→ goal is to find an optimal way to make decisions!
Delayed consequences

→ decisions now can impact future situations...
Exploration

→ agent should learn about the world by acting out

→ only get censored data(reward for decision) : don’t know what happens if she made different choice

→ decisions impact what the agent learns
Generalization

→ Policy is mapping from past experience to action

Comparing RL with similar AI procedures

AI Planning(바둑 등)

→ involves Optimization, Generalization, Delayed Consequences

→ does not require Exploration since model of the world is already given
Supervised Machine Learning

→ involves Optimization, Generalization

→ does not involve Exploration, Delayed Consequences for dataset, label are given

→ agent can immediately acknowledge results of her decisions(right or wrong for classification problems etc)
Unsupervised Machine Learning

→ involves Optimization, Generalization

→ does not involve Exploration, Delayed Consequences for dataset is given but label is not given
Imitation Learning

→ involves Optimization, Generalization, Delayed Consequences

→ does not require Exploration

→ observes and learns from other agent’s experiences

캡처.JPG

Goal : compose set of actions to maximize total expected future reward

→ may require strategic behavior to achieve max rewards(need to balance between immediate & long term rewards)

Agent & World Interaction(Discrete Time)

캡처2.JPG

Each time step $t$ :