Register now After registration you will be able to apply for this opportunity online.
This opportunity is not published. No applications will be accepted.
Data-driven inverse polynomial stochastic optimal control with applications to autonomous quadcopter flight
This project considers discrete-time Markov decision process and addresses the inverse problem of inferring a cost function from observed optimal behavior. The proposed method will avoid solving repeatedly the forward problem and its relevance will be illustrated on the quadcopter flight problem.
In forward discrete-time stochastic optimal control problems posed in the Markov decision process (MDP) formalism, an agent aims to make decisions under uncertainty at each stage of a multi-stage process so as to minimize a cost criterion, and it is typically assumed that the cost function is given. In many cases, however, one can not specify directly the cost or reward of a task, which in addition may not correspond to direct intuition or may not even exist, but instead can observe an optimal behavior. As a result, it is natural to consider the _inverse stochastic optimal control problem which consists of inferring a cost function in an MDP from expert demonstrations. Inverse stochastic optimal control, in the context of MDPs, is an active research area and has a wide spectrum of applications in many fields such as engineering, operations research and biology. There are two main motivations behind inverse decision making. The first one concerns situations where the cost function is of interest by itself, e.g., for scientific inquiry, modeling of human and animal behavior or modeling of other cooperative or adversarial agents. The second one concerns the task of imitation or apprenticeship learning by first recovering the expert's cost function and then using it to reproduce and synthesize the optimal behavior. For instance, in engineering, inverse MDPs can be used to explain and imitate the observed expert behavior, e.g., in the highway driving task, parking lot navigation, and urban navigation. Other examples can be found in humanoid robotics and understanding of human locomotion.
Most of existing inverse MDP methods, either are designed exclusively for MDPs with finite state and action spaces, or rely on an oracle access to an MDP solver which is used in the inner loop of an iterative procedure. This introduces a significant computational burden in applying these methods to MDPs with continuous state and action spaces, since solving a continuous MDP is a challenging and computationally expensive problem on its own, especially in the case of unknown dynamics. As a result, inverse optimal control for MDPs over uncountable spaces remains largely unexplored. The goal of this project is to contribute in this line of research. Under the assumption that the control model is polynomial, the student will propose a method that avoids solving repeatedly the forward problem and at the same time provides probabilistic performance guarantees on the quality of the recovered solution. The approach will be based on the LP formulation of MDPs, complementary slackness optimality conditions, recent developments in polynomial optimization and uniform finite sample bounds from statistical learning theory. The relevance of the developed approach will be illustrated on an autonomous flight problem, first in simulation, and if time permits also in experiment, using the exisiting quadrotor test-bed available at IfA.
In forward discrete-time stochastic optimal control problems posed in the Markov decision process (MDP) formalism, an agent aims to make decisions under uncertainty at each stage of a multi-stage process so as to minimize a cost criterion, and it is typically assumed that the cost function is given. In many cases, however, one can not specify directly the cost or reward of a task, which in addition may not correspond to direct intuition or may not even exist, but instead can observe an optimal behavior. As a result, it is natural to consider the _inverse stochastic optimal control problem which consists of inferring a cost function in an MDP from expert demonstrations. Inverse stochastic optimal control, in the context of MDPs, is an active research area and has a wide spectrum of applications in many fields such as engineering, operations research and biology. There are two main motivations behind inverse decision making. The first one concerns situations where the cost function is of interest by itself, e.g., for scientific inquiry, modeling of human and animal behavior or modeling of other cooperative or adversarial agents. The second one concerns the task of imitation or apprenticeship learning by first recovering the expert's cost function and then using it to reproduce and synthesize the optimal behavior. For instance, in engineering, inverse MDPs can be used to explain and imitate the observed expert behavior, e.g., in the highway driving task, parking lot navigation, and urban navigation. Other examples can be found in humanoid robotics and understanding of human locomotion.
Most of existing inverse MDP methods, either are designed exclusively for MDPs with finite state and action spaces, or rely on an oracle access to an MDP solver which is used in the inner loop of an iterative procedure. This introduces a significant computational burden in applying these methods to MDPs with continuous state and action spaces, since solving a continuous MDP is a challenging and computationally expensive problem on its own, especially in the case of unknown dynamics. As a result, inverse optimal control for MDPs over uncountable spaces remains largely unexplored. The goal of this project is to contribute in this line of research. Under the assumption that the control model is polynomial, the student will propose a method that avoids solving repeatedly the forward problem and at the same time provides probabilistic performance guarantees on the quality of the recovered solution. The approach will be based on the LP formulation of MDPs, complementary slackness optimality conditions, recent developments in polynomial optimization and uniform finite sample bounds from statistical learning theory. The relevance of the developed approach will be illustrated on an autonomous flight problem, first in simulation, and if time permits also in experiment, using the exisiting quadrotor test-bed available at IfA.
The learning objectives of this project are the following. The student will:
- learn and understand several concepts related to the topic. These include polynomial optimization (sum-of-squares programming, linear matrix inequalities), the LP approach to MDPs, as well as tools from statistical learning theory;
- gain experience in doing research. In particular the student will develop mathematical proofs and formal logic arguments;
- be familiar with the use of semidefinite programming (SDP) solvers (e.g., SeDuMi, Mosek) and the MATLAB toolbox SOSTOOLS and illustrate the relevance of the proposed method on toy-examples;
- (optional) acquire knowledge on modeling and simulation of quadcopters, implement the algorithm to real quadcopters and gain experience with how decisions in the modelling and design stage affect real-world performance.
The project is well suited for a student who enjoys mathematics. A solid background in analysis and convex optimization is required. Basic knowledge on Markov decision processes and/or modeling and control of quadcopters is desirable but not mandatory.
The learning objectives of this project are the following. The student will:
- learn and understand several concepts related to the topic. These include polynomial optimization (sum-of-squares programming, linear matrix inequalities), the LP approach to MDPs, as well as tools from statistical learning theory;
- gain experience in doing research. In particular the student will develop mathematical proofs and formal logic arguments;
- be familiar with the use of semidefinite programming (SDP) solvers (e.g., SeDuMi, Mosek) and the MATLAB toolbox SOSTOOLS and illustrate the relevance of the proposed method on toy-examples;
- (optional) acquire knowledge on modeling and simulation of quadcopters, implement the algorithm to real quadcopters and gain experience with how decisions in the modelling and design stage affect real-world performance.
The project is well suited for a student who enjoys mathematics. A solid background in analysis and convex optimization is required. Basic knowledge on Markov decision processes and/or modeling and control of quadcopters is desirable but not mandatory.