Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 9
Continuing discussion about gradient descent, recall that our goal is to converge ASAP to a local optima.
We want our policy update to be a monotonic improvement.
→ guarantees to converge (emphirical)
→ we simply don’t want to get fired...
Recall last time, we expressed gradient of value function as below
this term is unbiased but very noisy so we fix by
for original expectation
$$ \nabla_\theta V(\theta)=\nabla_\theta E_\tau[R]=E_\tau[\sum^{T-1}{t=0}\nabla\theta log\pi(a_t|s_t, \theta)(\sum^{T-1}{t'=t}r{t'})] $$
we introduce baseline to reduce variance
$$ \nabla_\theta E_\tau[R]=E_\tau[\sum^{T-1}{t=0}\nabla\theta log\pi(a_t|s_t, \theta)(\sum^{T-1}{t'=t}r{t'}-b(s_t))] $$
for any choice of $b$ gradient estimator is unbiased and lower variance.