Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 10

continuing our discussion over Updating Parameters Given the Gradient

Local Approximation

we couldn’t calculate equation above beacause we had no clue of what $\tilde{\pi}$ is. So for approximation, we replace the term with previous policy.

Untitled

we take policy $\pi^i$ run it out, get $D$ trajectories and use them to obtain distribution $\mu$ → use to compute for $\pi^{i+1}$.

오로지 계산을 위해 $\mu_{\tilde{\pi}}(s)$ 자리에 $\mu_\pi(s)$를 집어넣은 것.

so we “just say” it’s an objective function and something that can be optimized.

If you evaluate the function under same policy, you just get the same value.

Conservative Policy Iteration

we assume a new policy with blend of current policy and some other policy.

Untitled

again, if $\alpha = 0$, $\pi_{new}=\pi_{old}$ and so $V^{\pi_{new}}=L_{\pi_{old}}(\pi_{new})=L_{\pi_{old}}(\pi_{old})=V^{\pi_{old}}$

For any stochastic policies, you can get a bound on the performance with

$$ L_{\pi}(\tilde{\pi})=V(\theta)+\sum_s \mu_\pi(s)\sum_a\tilde{\pi}(a|s)A_\pi(s,a) $$

Untitled

$D^{max}_{TV}$ denotes the distance between probability that each of the two policies put on that action.

This theorem implies that