Changchun Master Li

CS229 machine learning 18
study note

2017-03-03

提纲

  • 0 Review
  • 1 different variations of MDPs
    • 1.1 State-Action Rewards
    • 1.2 Finite Horizon MDPS
  • 2 Linear Dynamical Systems
    • 2.1 Linear Quadratic Regulation
    • 2.2 Riccati Equation

0 回顾

MDP five tuples:

asymptotic convergence
value iteration:

infinite continuously MDPs
无法直接使用value iteration,因为?连续状态的MDP?近似?value function?.

事实上你可以表示value function,即使你没有一个infinite space和continuous state space.
special classes of infinite state MDPs making the formulation easier.

1.1 State-Action Rewards

sequence:

total pay off:

It’s actually turns out that given a MDP with state-action rewards, you can by filling the definitions of states reduce this back to MDP with only rewards of function of states.

Allow you to more directly model problems which different actions may have different costs.

e.g.

  • 机器人充电
  • 开车 与paved road相比, 石头路和草地cost more

value iteration:

1.2 Finite Horizon MDPs

comprise 5 tuple:

T is horizon time, concretely means that taking actions in the MDPs only for a total of T time steps

sequence:

total pay off:

The world only exists T times steps

optimal policy is non-stationary, which means depending on time roughly ( my optimal action to take will be different for different time steps ).

e.g.

  • 游戏 往左走一分 往右走十分 但是右边马上消失

The state transition probabilities can also be a function of time.

e.g.

  • 飞行器 burn fuel
  • traffic forcast

this is the algorithm.

$$
\text{define:} \quad V_t^\star(s) = E[ R^{(t)}(s_t,a_t) + R^{(t+1)}(s_{t+1},a_{t+1}) + \quad … \quad + R^{(T)}(s_T,a_T) \mid \Pi^\star, s_t=s] \
\quad \
V_t^\star(s) = \underbrace{\max_a R^{(t)}(s,a)}\text{immediate reward} + \underbrace{\sum{s^\prime} R_{sa}^{(t)}(s^\prime) V^\star(s^\prime)}\text{expected future pay off}
\quad \
V_T^\star(s) = \max_a R^{(T)}(s,a)
\quad \
\Pi_t^\star(s) = \arg \max_a R^{(t)}(s,a) + \sum
{s^\prime} P_{sa}(s^\prime)V_{t+1}^\star(s^\prime)
$$

recursive once, then have the value function rather than converge asymptotically

, so it’s similar to discounting. And either horizon or discounting is strong assumption.

2.1 LQR

dynamic programming for finite horizon MDPs

连续 state/action 空间

e.g.

  • helicopter want , choose , then

general case:


so, A and B equal to

linearise a non-linear model:

e.g.

  • 倒立摆

cartoon: 略

take the derivative of function f in

choosing the vicinity of the state as the assumption of

计算过程推导

递归计算V star

can be represented as a quadratic funtion.

suppose that

in which

could show that, for some ,

, so ,

dynamic programming step

simplifies to quadratic function of

$$
V_t^\star(s_t) = \max_{a_t} - s_t^T U_t s_t - a_t^T V_t a_t + E_{ s_{t+1} \sim \underbrace{\cal{N} (A_ts_t + B_ta_t, \Sigma_{\cal{w}})}{gaussian\ distribution\ P{s_ta_t}} } \underbrace{\left[ s_{t+1}^T \Phi_{t+1} s_{t+1} + \Psi_{t+1} \right]}{V{t+1}^\star}
$$

不证明给出

所以

the assume of the optimal optimal policy is a straight line. that is not about approximating.

2.2 Riccati方程

discrete time Riccati equation

summarize

The very cool thing about the solution of decrete time on LQR problems is that this is a problem of an infinite state with a continuous state. Nonetheless, under the assumptions that the value function is a quadratic function of the state we may improve. Therefore just by computing these matrice and the real number , you can exactly represent the value function even for these are infinitely large state spaces with continuous state spaces and so the computation of these algorithms kills only like the cube on scales only polynomially in terms of the number of state variables. LQR kills only like the cube of the dimension of the problems.

Just easily apply even to problems very large state spaces, so we actually often apply variations of algorithms to some particular subset of things which is high dimensional spaces like 12 or higher dimensions. And this works very well.

使用支付宝打赏
使用微信打赏

若你觉得我的文章对你有帮助,欢迎点击上方按钮对我打赏

扫描二维码,分享此文章