提纲
- 0 Review
- 1 different variations of MDPs
- 1.1 State-Action Rewards
- 1.2 Finite Horizon MDPS
- 2 Linear Dynamical Systems
- 2.1 Linear Quadratic Regulation
- 2.2 Riccati Equation
0 回顾
MDP five tuples:
asymptotic convergence
value iteration:
infinite continuously MDPs
无法直接使用value iteration,因为?连续状态的MDP?近似?value function?.
事实上你可以表示value function,即使你没有一个infinite space和continuous state space.
special classes of infinite state MDPs making the formulation easier.
1.1 State-Action Rewards
sequence:
total pay off:
It’s actually turns out that given a MDP with state-action rewards, you can by filling the definitions of states reduce this back to MDP with only rewards of function of states.
Allow you to more directly model problems which different actions may have different costs.
e.g.
- 机器人充电
- 开车 与paved road相比, 石头路和草地cost more
value iteration:
1.2 Finite Horizon MDPs
comprise 5 tuple:
T is horizon time, concretely means that taking actions in the MDPs only for a total of T time steps
sequence:
total pay off:
The world only exists T times steps
optimal policy is non-stationary, which means depending on time roughly ( my optimal action to take will be different for different time steps ).
e.g.
- 游戏 往左走一分 往右走十分 但是右边马上消失
The state transition probabilities can also be a function of time.
e.g.
- 飞行器 burn fuel
- traffic forcast
this is the algorithm.
$$
\text{define:} \quad V_t^\star(s) = E[ R^{(t)}(s_t,a_t) + R^{(t+1)}(s_{t+1},a_{t+1}) + \quad … \quad + R^{(T)}(s_T,a_T) \mid \Pi^\star, s_t=s] \
\quad \
V_t^\star(s) = \underbrace{\max_a R^{(t)}(s,a)}\text{immediate reward} + \underbrace{\sum{s^\prime} R_{sa}^{(t)}(s^\prime) V^\star(s^\prime)}\text{expected future pay off}
\quad \
V_T^\star(s) = \max_a R^{(T)}(s,a)
\quad \
\Pi_t^\star(s) = \arg \max_a R^{(t)}(s,a) + \sum{s^\prime} P_{sa}(s^\prime)V_{t+1}^\star(s^\prime)
$$
recursive once, then have the value function rather than converge asymptotically
2.1 LQR
dynamic programming for finite horizon MDPs
连续 state/action 空间
e.g.
- helicopter want
, choose , then
general case:
so, A and B equal to
linearise a non-linear model:
e.g.
- 倒立摆
cartoon: 略
take the derivative of function f in
choosing the vicinity of the state as the assumption of
计算过程推导
递归计算V star
suppose that
in which
could show that, for some
dynamic programming step
simplifies to quadratic function of
$$
V_t^\star(s_t) = \max_{a_t} - s_t^T U_t s_t - a_t^T V_t a_t + E_{ s_{t+1} \sim \underbrace{\cal{N} (A_ts_t + B_ta_t, \Sigma_{\cal{w}})}{gaussian\ distribution\ P{s_ta_t}} } \underbrace{\left[ s_{t+1}^T \Phi_{t+1} s_{t+1} + \Psi_{t+1} \right]}{V{t+1}^\star}
$$
不证明给出
所以
the assume of the optimal optimal policy is a straight line. that is not about approximating.
2.2 Riccati方程
discrete time Riccati equation
summarize
The very cool thing about the solution of decrete time on LQR problems is that this is a problem of an infinite state with a continuous state. Nonetheless, under the assumptions that the value function is a quadratic function of the state we may improve. Therefore just by computing these matrice
Just easily apply even to problems very large state spaces, so we actually often apply variations of algorithms to some particular subset of things which is high dimensional spaces like 12 or higher dimensions. And this works very well.
若你觉得我的文章对你有帮助,欢迎点击上方按钮对我打赏
扫描二维码,分享此文章