Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 1
How an intelligent agent learns to make good sequences of decisions according to repeated interactions with World
Key aspects of RL
Optimization
→ goal is to find an optimal way to make decisions!
Delayed consequences
→ decisions now can impact future situations...
Exploration
→ agent should learn about the world by acting out
→ only get censored data(reward for decision) : don’t know what happens if she made different choice
→ decisions impact what the agent learns
Generalization
→ Policy is mapping from past experience to action
Comparing RL with similar AI procedures
AI Planning(바둑 등)
→ involves Optimization, Generalization, Delayed Consequences
→ does not require Exploration since model of the world is already given
Supervised Machine Learning
→ involves Optimization, Generalization
→ does not involve Exploration, Delayed Consequences for dataset, label are given
→ agent can immediately acknowledge results of her decisions(right or wrong for classification problems etc)
Unsupervised Machine Learning
→ involves Optimization, Generalization
→ does not involve Exploration, Delayed Consequences for dataset is given but label is not given
Imitation Learning
→ involves Optimization, Generalization, Delayed Consequences
→ does not require Exploration
→ observes and learns from other agent’s experiences
Goal : compose set of actions to maximize total expected future reward
→ may require strategic behavior to achieve max rewards(need to balance between immediate & long term rewards)
Agent & World Interaction(Discrete Time)
Each time step $t$ :