Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 2

Given the model of the world

Untitled

Markov Property → stochastic process evolving over time(whether or not I investi stocks, stock market changes)

Markov Chain

Let $S$ be set of states ($s \in S$) and $P$ a transition model that specifies $P(s_{t+1}=s'|s_t=s)$

for finite number($N$) of states, we get transition matrix $P$

Untitled

example discussed last section(we abort discussion of rewards and actions for easy understanding)

Untitled

at state $s_1$ we have 0.4 chance of transfering to $s_2$ ($P(s_1|s_2)$) and 0.6 probability of staying $s_1$ ($P(s_1|s_1)$). Such probability matrix is expressed as $P$ above.

Let’s say we start at $s_1$, we can calculate agent’s probablility of next state by calculating dot product of $[1 ,0,0,0,0,0,0]$ and $P$ above. As result we get $[0.6,0.4,0,0,0,0,0]^T$


Markov Reward Process(MRP)

for finite number($N$) of states,

$S$ : set of states ($s \in S$)