Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 8
Recall last lecture → where we learned to find state-value($V$) or state-action value($Q$) for parameter $w$(or $\theta$) and use such to build (hopefully)optimal policy($\pi$)
Today we’ll take that policy $\pi^\theta$ as parameter
$$ \pi^\theta(s,a)=P[a|s;\theta] $$
our goal is to find a policy with maximal value function $V^\pi$
Benefits and Demerits
Advantages | Disadvantages |
---|---|
- better converges properties |
Stochastic properties needed at aliased environment
$$ Q_\theta(s,a)=f(\phi(s,a),\theta) $$
$$ \pi_\theta(s,a)=g(\phi(s,a),\theta) $$
it the policy were deterministic agent would get stuck anyway at either A or B
whereas stochastic policy leaves probability of moving both directions.
deterministic policy
stochastic policy
Policy Objective Function