Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 10
continuing our discussion over Updating Parameters Given the Gradient
we couldn’t calculate equation above beacause we had no clue of what $\tilde{\pi}$ is. So for approximation, we replace the term with previous policy.
we take policy $\pi^i$ run it out, get $D$ trajectories and use them to obtain distribution $\mu$ → use to compute for $\pi^{i+1}$.
오로지 계산을 위해 $\mu_{\tilde{\pi}}(s)$ 자리에 $\mu_\pi(s)$를 집어넣은 것.
so we “just say” it’s an objective function and something that can be optimized.
If you evaluate the function under same policy, you just get the same value.
Conservative Policy Iteration
we assume a new policy with blend of current policy and some other policy.
again, if $\alpha = 0$, $\pi_{new}=\pi_{old}$ and so $V^{\pi_{new}}=L_{\pi_{old}}(\pi_{new})=L_{\pi_{old}}(\pi_{old})=V^{\pi_{old}}$
For any stochastic policies, you can get a bound on the performance with
$$ L_{\pi}(\tilde{\pi})=V(\theta)+\sum_s \mu_\pi(s)\sum_a\tilde{\pi}(a|s)A_\pi(s,a) $$
$D^{max}_{TV}$ denotes the distance between probability that each of the two policies put on that action.
This theorem implies that