Lecture 8 | Notion

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 8

Policy-Based Reinforcement Learning

Recall last lecture → where we learned to find state-value($V$) or state-action value($Q$) for parameter $w$(or $\theta$) and use such to build (hopefully)optimal policy($\pi$)

Today we’ll take that policy $\pi^\theta$ as parameter

$$ \pi^\theta(s,a)=P[a|s;\theta] $$

our goal is to find a policy with maximal value function $V^\pi$

Benefits and Demerits

Advantages	Disadvantages
- better converges properties

works efficiently with high-dimension or continous action spaces
can learn stochastic policies | - converges to local optima rather than global
inefficient & high varience |

Stochastic properties needed at aliased environment

Screenshot from 2022-03-28 17-03-38.png

Value-based Rl would take features(combination of actions & is ther a wall?)

$$ Q_\theta(s,a)=f(\phi(s,a),\theta) $$

Policy-based RL would take those features and directly make decisions of action

$$ \pi_\theta(s,a)=g(\phi(s,a),\theta) $$

it the policy were deterministic agent would get stuck anyway at either A or B

whereas stochastic policy leaves probability of moving both directions.

deterministic policy

stochastic policy

Policy Objective Function