Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 3

recap MDP evaluation of Dynamic Programming

Dynamic Programming

Initialize $V_0 (s)=0$ for all s

for k = 1 until convergence

for all $s$ in $S$

$V^\pi_k(s)=r(s,\pi(s)) + \gamma \sum_{s' \in S} P(s'|s,\pi(s))V^\pi_{k-1}(s')$

and we iterate until it converges → $||V^\pi_k-V^\pi_{k-1}||<\epsilon$


Model Free Policy Evaluation

we do not know the reward & dynamics model

Monte Carlo(MC) Policy Evaluation

Untitled

→ we think of all the possible trajectories we can get from the policy and average all the returns

<Evaluation Metric>

Consider a statistic $\hat{\theta}$ that provides an estimate of $\theta$ and is a function of observed data $x$ → $\hat{\theta}=f(x)$

Definition
Bias(편차) $Bias_\theta(\hat{\theta}) = E_{x
Variance(분산) $Var_\theta(\hat{\theta}) = E_{x
MSE $MSE(\hat{\theta})=Var(\hat{\theta})+Bias_\theta(\hat{\theta})^2$