Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 8

Policy-Based Reinforcement Learning

Recall last lecture → where we learned to find state-value($V$) or state-action value($Q$) for parameter $w$(or $\theta$) and use such to build (hopefully)optimal policy($\pi$)

Today we’ll take that policy $\pi^\theta$ as parameter

$$ \pi^\theta(s,a)=P[a|s;\theta] $$

our goal is to find a policy with maximal value function $V^\pi$

Benefits and Demerits

Advantages Disadvantages
- better converges properties

Stochastic properties needed at aliased environment

Screenshot from 2022-03-28 17-03-38.png

$$ Q_\theta(s,a)=f(\phi(s,a),\theta) $$

$$ \pi_\theta(s,a)=g(\phi(s,a),\theta) $$

it the policy were deterministic agent would get stuck anyway at either A or B

whereas stochastic policy leaves probability of moving both directions.

deterministic policy

deterministic policy

stochastic policy

stochastic policy

Policy Objective Function