This opportunity is not published. No applications will be accepted.

Learning to Manipulate Objects using Natural Language and Visual Inputs on a Robotic Manipulator

Natural Language Algorithms and Large Language Models, exemplified by GPT-4, have shown remarkable prowess across diverse domains. However, achieving human-like communication with robots remains a challenge. This project addresses the gap by enhancing the interface between natural language algorithms and robotic systems. Utilizing an existing chatGPT-based interface, we aim to introduce a vision component for dynamic environmental adaptation and employ it to assess task success. This metric, in turn, will fine-tune the language algorithm using reinforcement learning. The project's goal is a real-world demonstration of these advancements in a robotic manipulator, marking a significant stride towards more autonomous systems and sophisticated artificial intelligence.

Keywords: Robotics, Control, Computer Vision, Reinforcement Learning

Description
The capabilities of Natural Language algorithms and Large Language Models like GPT4 have been impressive across various domains and disciplines, beyond what we believed was possible. However, despite their success, we still cannot communicate with robots in the same way we communicate with other humans. For humans, language is our way of communicating, conveying instructions, and coordinating teams. Yet, how to use language to rely information to machines for performing a task is still an open question. Despite recent efforts, only preliminary results exist on how to interface natural language with robotic systems. Current approaches often lack feedback through vision and/or are trained in an end-to-end fashion, making these systems hard to interpret and challenging to use in safety dependent environments. Moreover, the language-robot interface creates ample opportunities. For instance, since task success (as assessed through vision) can be used to finetune language models via reinforcement learning, we can further expand on grounding natural language algorithms for specific tasks. Interacting with our robots through natural language is definitely an important steps towards more autonomous systems and more capable artificial intelligence. In this project, we aim at addressing some of these gaps. A ready-to-use interface already exists where we can use chatGPT to convey instructions to a robotic manipulator in a simulation environment, which realizes the desired trajectories via Model Predictive Control (MPC).
The capabilities of Natural Language algorithms and Large Language Models like GPT4 have been impressive across various domains and disciplines, beyond what we believed was possible. However, despite their success, we still cannot communicate with robots in the same way we communicate with other humans. For humans, language is our way of communicating, conveying instructions, and coordinating teams. Yet, how to use language to rely information to machines for performing a task is still an open question. Despite recent efforts, only preliminary results exist on how to interface natural language with robotic systems. Current approaches often lack feedback through vision and/or are trained in an end-to-end fashion, making these systems hard to interpret and challenging to use in safety dependent environments. Moreover, the language-robot interface creates ample opportunities. For instance, since task success (as assessed through vision) can be used to finetune language models via reinforcement learning, we can further expand on grounding natural language algorithms for specific tasks. Interacting with our robots through natural language is definitely an important steps towards more autonomous systems and more capable artificial intelligence.
In this project, we aim at addressing some of these gaps. A ready-to-use interface already exists where we can use chatGPT to convey instructions to a robotic manipulator in a simulation environment, which realizes the desired trajectories via Model Predictive Control (MPC).
Goal
The goal of this project is to enhance the existing architecture by (1) introducing a vision interface so the robot can operate in dynamic environments, (2) use the vision interface to assess task success, and rely on this as a metric to finetune the language algorithm using reinforcement learning. The expected outcome of this project is a demonstration of these capabilities in a real-world hardware implementation of the robotic manipulator. The project tasks include: • Introducing a object detection network such as YOLO) as an additional source of input beyond language in the current simulation environment. • Testing the proposed architecture on hardware and assessing how vision improves robustness and generalizability. The assessment can be done using the current setup (without vision) as a baseline, and measuring the number of successful task completions achieved with the inclusion of vision for different perturbations of the environment, such as changing object placement, object types etc. • Designing and implementing a reinforcement learning from (human) feedback (RLHF) algorithm using Proximal Policy Optimization (PPO) to finetune the language algorithm based on task success (as assessed by the vision algorithm).
The goal of this project is to enhance the existing architecture by (1) introducing a vision interface so the robot can operate in dynamic environments, (2) use the vision interface to assess task success, and rely on this as a metric to finetune the language algorithm using reinforcement learning. The expected outcome of this project is a demonstration of these capabilities in a real-world hardware implementation of the robotic manipulator. The project tasks include:
• Introducing a object detection network such as YOLO) as an additional source of input beyond language in the current simulation environment.
• Testing the proposed architecture on hardware and assessing how vision improves robustness and generalizability. The assessment can be done using the current setup (without vision) as a baseline, and measuring the number of successful task completions achieved with the inclusion of vision for different perturbations of the environment, such as changing object placement, object types etc.
• Designing and implementing a reinforcement learning from (human) feedback (RLHF) algorithm using Proximal Policy Optimization (PPO) to finetune the language algorithm based on task success (as assessed by the vision algorithm).
Contact Details
Dr. Carmen Amo Alonso (camoalonso@ethz.ch) Dr. Andrea Carron (carrona@ethz.ch) René Zurbrügg (zrene@ai.ethz.ch)
Dr. Carmen Amo Alonso (camoalonso@ethz.ch)
Dr. Andrea Carron (carrona@ethz.ch)
René Zurbrügg (zrene@ai.ethz.ch)

Calendar

Earliest start	2023-11-15
Latest end	2024-09-01

Location

Research Zeilinger (ETHZ)

Labels

Semester Project
Master Thesis

Topics

Engineering and Technology

Documents

Name	Comment	Size	Actions
IDSC-MZ-CAA-AC-RZ_LanguageRoboticArmVision_revised.pdf		198KB	Download