Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 3
recap MDP evaluation of Dynamic Programming
Initialize $V_0 (s)=0$ for all s
for k = 1 until convergence
for all $s$ in $S$
$V^\pi_k(s)=r(s,\pi(s)) + \gamma \sum_{s' \in S} P(s'|s,\pi(s))V^\pi_{k-1}(s')$
and we iterate until it converges → $||V^\pi_k-V^\pi_{k-1}||<\epsilon$
if k is finite
→ $V^\pi_k(s)$ is exact value of k-horizon value of state $s$ under policy $\pi$
if k is infinite
→ $V^\pi_k(s)$ is approximate value of state $s$ under policy $\pi$
→ $V^\pi_k(s)$ ← $E_\pi [r_t+\gamma V_{k-1} | s_t=s]$
resulted states of an action would be “expectation” over the state-action.
“when we know the model, we can compute immediate reward and exact estimate sum of future states. then we can substitue, instead of expanding $V_{k-1}$ as sum of rewards, we can boot strap and use current estimate $V_{k-1}$”
*Bootstrapping : 같은 종류의 예측값을 업데이트 할 때 예측값을 이용하는 것(예측값을 사용해서 예측값을 업데이트)
we do not know the reward & dynamics model
→ we think of all the possible trajectories we can get from the policy and average all the returns
no bootstrapping
can only be applied to episodic MDPs
→ averaging over returns from complete episodes
→ requires episode to terminate
<Evaluation Metric>
Consider a statistic $\hat{\theta}$ that provides an estimate of $\theta$ and is a function of observed data $x$ → $\hat{\theta}=f(x)$
Definition | |
---|---|
Bias(편차) | $Bias_\theta(\hat{\theta}) = E_{x |
Variance(분산) | $Var_\theta(\hat{\theta}) = E_{x |
MSE | $MSE(\hat{\theta})=Var(\hat{\theta})+Bias_\theta(\hat{\theta})^2$ |