Register now After registration you will be able to apply for this opportunity online.
This opportunity is not published. No applications will be accepted.
Humanoid Locomotion Learning and Finetuning from Human Feedback
In the burgeoning field of deep reinforcement learning (RL), agents autonomously develop complex behaviors through a process of trial and error. Yet, the application of RL across various domains faces notable hurdles, particularly in devising appropriate reward functions. Traditional approaches often resort to sparse rewards for simplicity, though these prove inadequate for training efficient agents. Consequently, real-world applications may necessitate elaborate setups, such as employing accelerometers for door interaction detection, thermal imaging for action recognition, or motion capture systems for precise object tracking. Despite these advanced solutions, crafting an ideal reward function remains challenging due to the propensity of RL algorithms to exploit the reward system in unforeseen ways. Agents might fulfill objectives in unexpected manners, highlighting the complexity of encoding desired behaviors, like adherence to social norms, into a reward function.
An alternative strategy, imitation learning, circumvents the intricacies of reward engineering by having the agent learn through the emulation of expert behavior. However, acquiring a sufficient number of high-quality demonstrations for this purpose is often impractically costly. Humans, in contrast, learn with remarkable autonomy, benefiting from intermittent guidance from educators who provide tailored feedback based on the learner's progress. This interactive learning model holds promise for artificial agents, offering a customized learning trajectory that mitigates reward exploitation without extensive reward function engineering. The challenge lies in ensuring the feedback process is both manageable for humans and rich enough to be effective. Despite its potential, the implementation of human-in-the-loop (HiL) RL remains limited in practice. Our research endeavors to significantly lessen the human labor involved in HiL learning, leveraging both unsupervised pre-training and preference-based learning to enhance agent development with minimal human intervention.
Keywords: reinforcement learning from human feedback, preference learning
**Work packages**
Literature research
Reinforcement learning from human feedback
Preference learning
**Requirements**
Strong programming skills in Python
Experience in reinforcement learning frameworks
**Publication**
This project will mostly focus on algorithm design and system integration. Promising results will be submitted to robotics or machine learning conferences where outstanding robotic performances are highlighted.
**Related literature**
Christiano, Paul F., et al. "Deep reinforcement learning from human preferences." Advances in neural information processing systems 30 (2017).
Lee, Kimin, Laura Smith, and Pieter Abbeel. "Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training." arXiv preprint arXiv:2106.05091 (2021).
Wang, Xiaofei, et al. "Skill preferences: Learning to extract and execute robotic skills from human feedback." Conference on Robot Learning. PMLR, 2022.
Li, Chenhao, et al. "FLD: Fourier Latent Dynamics for Structured Motion Representation and Learning." arXiv preprint arXiv:2402.13820 (2024).
**Work packages**
Literature research
Reinforcement learning from human feedback
Preference learning
**Requirements**
Strong programming skills in Python
Experience in reinforcement learning frameworks
**Publication**
This project will mostly focus on algorithm design and system integration. Promising results will be submitted to robotics or machine learning conferences where outstanding robotic performances are highlighted.
**Related literature**
Christiano, Paul F., et al. "Deep reinforcement learning from human preferences." Advances in neural information processing systems 30 (2017).
Lee, Kimin, Laura Smith, and Pieter Abbeel. "Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training." arXiv preprint arXiv:2106.05091 (2021).
Wang, Xiaofei, et al. "Skill preferences: Learning to extract and execute robotic skills from human feedback." Conference on Robot Learning. PMLR, 2022.
Li, Chenhao, et al. "FLD: Fourier Latent Dynamics for Structured Motion Representation and Learning." arXiv preprint arXiv:2402.13820 (2024).
The goal of the project is to learn and finetune humanoid locomotion policies using reinforcement learning from human feedback. The challenge lies in learning effective reward models from an efficient representation of motion clips, as opposed to single-state frames. The tentative pipeline works as follows:
1. A self-supervised motion representation pretraining phase that learns efficient trajectory representations, potentially using Fourier Latent Dynamics, with data generated by some initial policies.
2. Reward learning from human feedback, conditioned on the trajectory representation learned in the first step. Human preference from visualizing the motions is thus embedded in this latent trajectory representation.
3. Policy training with the learning reward. The induced trajectories from the learned policy are used to augment the training set for the first two steps. The process continues.
The goal of the project is to learn and finetune humanoid locomotion policies using reinforcement learning from human feedback. The challenge lies in learning effective reward models from an efficient representation of motion clips, as opposed to single-state frames. The tentative pipeline works as follows:
1. A self-supervised motion representation pretraining phase that learns efficient trajectory representations, potentially using Fourier Latent Dynamics, with data generated by some initial policies.
2. Reward learning from human feedback, conditioned on the trajectory representation learned in the first step. Human preference from visualizing the motions is thus embedded in this latent trajectory representation.
3. Policy training with the learning reward. The induced trajectories from the learned policy are used to augment the training set for the first two steps. The process continues.
Please include your CV and transcript in the submission.
**Chenhao Li**
https://breadli428.github.io/
chenhli@ethz.ch
**Xin Chen**
https://www.xccyn.com/
xin.chen@inf.ethz.ch
Please include your CV and transcript in the submission.