Register now After registration you will be able to apply for this opportunity online.
This opportunity is not published. No applications will be accepted.
Data-driven policy iteration schemes based on linear programming
Learning how to optimally regulate a dynamical system from data is a fundamental problem in control theory. This project focuses on investigating new theory and methods about policy iteration schemes based on linear programming.
Keywords: Optimization, Approximate Dynamic Programming, Policy Iteration, Linear Programming
Optimal control problems arise in most engineering and scientific disciplines, from robotics to bioengineering to finance, to name just a few.
One of the main approach to solve optimal control problems
was developed in the 1950’s by Bellman. It is based on the so-called principle of optimality to derive a functional equation in the state and input variables, the Bellman equation,
characterizing the optimal feedback controller. The methods to solve the Bellman equation are collectively known as dynamic programming (DP) and typically rely on three fundamental techniques: value iteration, policy iteration
(PI), and linear programming (LP). The main difficulty affecting DP methods is the rapid growth of the
computation needed to solve the Bellman equation with the number of variables involved, known as the curse of dimensionality. Throughout the years, several approximation techniques have been developed
to mitigate this problem, collectively known as approximate dynamic programming (ADP).
In many real-world applications one has to operate a system without the knowledge of its dynamics.
Learning how to optimally regulate a system is possible by letting it interact with the environment and collecting information about the resulting state transitions and costs (or rewards) over time, an approach classically known as reinforcement learning (RL). In this context, one can reformulate the Bellman equation in terms of the Q-function. The main advantage of the Q-function formulation is that one can extract a policy directly from Q, without knowledge of the system dynamics or stage cost.
In this work, we focus on PI and LP method for Q-learning. The PI algorithm aims to compute
an optimal stationary policy via an iterative procedure. It starts from an initial policy, and performs at each iteration a policy evaluation step (i.e. computes the Q-function associated to the current policy), followed by the policy improvement step. Under certain assumptions, it can be shown that the sequence of policies generated by the PI algorithm converges to the optimal one. On the other hand, the LP method is based on the idea that the solution to the Bellman equation can be cast as the optimizer of an LP. When dealing with infinite state and action spaces, the exact infinite dimensional LPs have to be approximated by finite dimensional ones. The two methods can be combined in order to learn a sequence of Q-functions by formulating the policy evaluation step in terms of an LP.
The main question in this PI-LP strategy is how to construct the linear programs approach in order
to converge to the optimal Q-function with the minimum number of iterations. In this project, we investigate
two novel PI-LP algorithms and we analyze their performances in terms of optimality and rate of
convergence.
Optimal control problems arise in most engineering and scientific disciplines, from robotics to bioengineering to finance, to name just a few. One of the main approach to solve optimal control problems was developed in the 1950’s by Bellman. It is based on the so-called principle of optimality to derive a functional equation in the state and input variables, the Bellman equation, characterizing the optimal feedback controller. The methods to solve the Bellman equation are collectively known as dynamic programming (DP) and typically rely on three fundamental techniques: value iteration, policy iteration (PI), and linear programming (LP). The main difficulty affecting DP methods is the rapid growth of the computation needed to solve the Bellman equation with the number of variables involved, known as the curse of dimensionality. Throughout the years, several approximation techniques have been developed to mitigate this problem, collectively known as approximate dynamic programming (ADP).
In many real-world applications one has to operate a system without the knowledge of its dynamics. Learning how to optimally regulate a system is possible by letting it interact with the environment and collecting information about the resulting state transitions and costs (or rewards) over time, an approach classically known as reinforcement learning (RL). In this context, one can reformulate the Bellman equation in terms of the Q-function. The main advantage of the Q-function formulation is that one can extract a policy directly from Q, without knowledge of the system dynamics or stage cost.
In this work, we focus on PI and LP method for Q-learning. The PI algorithm aims to compute an optimal stationary policy via an iterative procedure. It starts from an initial policy, and performs at each iteration a policy evaluation step (i.e. computes the Q-function associated to the current policy), followed by the policy improvement step. Under certain assumptions, it can be shown that the sequence of policies generated by the PI algorithm converges to the optimal one. On the other hand, the LP method is based on the idea that the solution to the Bellman equation can be cast as the optimizer of an LP. When dealing with infinite state and action spaces, the exact infinite dimensional LPs have to be approximated by finite dimensional ones. The two methods can be combined in order to learn a sequence of Q-functions by formulating the policy evaluation step in terms of an LP.
The main question in this PI-LP strategy is how to construct the linear programs approach in order to converge to the optimal Q-function with the minimum number of iterations. In this project, we investigate two novel PI-LP algorithms and we analyze their performances in terms of optimality and rate of convergence.
1. Deep understanding of the Bellman operator and the fundamental concepts in the theory of ADP
and its methods.
2. Focus on the linear programming approach and its finite dimensional approximation, with particular attention to the problem of constructing an LP with bounded solutions.
3. Coding/implementation: convergence and performance analysis of two novel PI-LP iterative algorithms and comparison with existing methods.
4. Theory (advanced): theoretical error bounds associated to the proposed algorithms
1. Deep understanding of the Bellman operator and the fundamental concepts in the theory of ADP and its methods.
2. Focus on the linear programming approach and its finite dimensional approximation, with particular attention to the problem of constructing an LP with bounded solutions.
3. Coding/implementation: convergence and performance analysis of two novel PI-LP iterative algorithms and comparison with existing methods.
4. Theory (advanced): theoretical error bounds associated to the proposed algorithms
Lucia Falconi (lfalconi@control.ee.ethz.ch),
Andrea Martinelli(andremar@control.ee.ethz.ch)
Lucia Falconi (lfalconi@control.ee.ethz.ch), Andrea Martinelli(andremar@control.ee.ethz.ch)