Register now After registration you will be able to apply for this opportunity online.
Beyond Value Functions: Stable Robot Learning with Monte-Carlo GRPO
Robotics is dominated by on-policy reinforcement learning: the paradigm of training a robot controller by iteratively interacting with the environment and maximizing some objective. A crucial idea to make this work is the Advantage Function. On each policy update, algorithms typically sum up the gradient log probabilities of all actions taken in the robot simulation. The advantage function increases or decreases the probabilities of these taken actions by comparing their “goodness” versus a baseline. Current advantage estimation methods use a value function to aggregate robot experience and hence decrease variance. This improves sample efficiency at the cost of introducing some bias.
Stably training large language models via reinforcement learning is well-known to be a challenging task. A line of recent work [1, 2] has used Group-Relative Policy Optimization (GRPO) to achieve this feat. In GRPO, a series of answers are generated for each query-answer pair. The advantage is calculated based on a given answer being better than the average answer to the query. In this formulation, no value function is required.
Can we adapt GRPO towards robot learning? Value Functions are known to cause issues in training stability [3] and a result in biased advantage estimates [4]. We are in the age of GPU-accelerated RL [5], training policies by simulating thousands of robot instances simultaneously. This makes a new monte-carlo (MC) approach towards RL timely, feasible and appealing. In this project, the student will be tasked to investigate the limitations of value-function based advantage estimation. Using GRPO as a starting point, the student will then develop MC-based algorithms that use the GPU’s parallel simulation capabilities for stable RL training for unbiased variance reduction while maintaining a competitive wall-clock time.
Keywords: Robot Learning, Reinforcement Learning, Monte Carlo RL, GRPO, Advantage Estimation
Co-supervised by Jing Yuan Luo (Mujoco)
Co-supervised by Jing Yuan Luo (Mujoco)
- Literature research
- Investigate the bias and variance properties of the PPO value function
- Design and implement a novel algorithm that achieves variance reduction through monte carlo sampling via massive environment parallelism
- Re-implement existing SOTA algorithms as benchmarks
- Bonus: provide theoretical insights to justify your proposed monte carlo method
- Literature research - Investigate the bias and variance properties of the PPO value function - Design and implement a novel algorithm that achieves variance reduction through monte carlo sampling via massive environment parallelism - Re-implement existing SOTA algorithms as benchmarks - Bonus: provide theoretical insights to justify your proposed monte carlo method
- Background in Learning
- Excellent knowledge of Python
- Background in Learning - Excellent knowledge of Python