Lecture 3 | Notion

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 3

recap MDP evaluation of Dynamic Programming

Dynamic Programming

Initialize $V_0 (s)=0$ for all s

for k = 1 until convergence

for all $s$ in $S$

$V^\pi_k(s)=r(s,\pi(s)) + \gamma \sum_{s' \in S} P(s'|s,\pi(s))V^\pi_{k-1}(s')$

and we iterate until it converges → $||V^\pi_k-V^\pi_{k-1}||<\epsilon$

if k is finite

→ $V^\pi_k(s)$ is exact value of k-horizon value of state $s$ under policy $\pi$
if k is infinite

→ $V^\pi_k(s)$ is approximate value of state $s$ under policy $\pi$

→ $V^\pi_k(s)$ ← $E_\pi [r_t+\gamma V_{k-1} | s_t=s]$

resulted states of an action would be “expectation” over the state-action.

“when we know the model, we can compute immediate reward and exact estimate sum of future states. then we can substitue, instead of expanding $V_{k-1}$ as sum of rewards, we can boot strap and use current estimate $V_{k-1}$”

*Bootstrapping : 같은 종류의 예측값을 업데이트 할 때 예측값을 이용하는 것(예측값을 사용해서 예측값을 업데이트)

we do not know the reward & dynamics model

Untitled

→ we think of all the possible trajectories we can get from the policy and average all the returns

no bootstrapping
can only be applied to episodic MDPs

→ averaging over returns from complete episodes

→ requires episode to terminate

<Evaluation Metric>

Consider a statistic $\hat{\theta}$ that provides an estimate of $\theta$ and is a function of observed data $x$ → $\hat{\theta}=f(x)$

	Definition
Bias(편차)	$Bias_\theta(\hat{\theta}) = E_{x
Variance(분산)	$Var_\theta(\hat{\theta}) = E_{x
MSE	$MSE(\hat{\theta})=Var(\hat{\theta})+Bias_\theta(\hat{\theta})^2$