Register now After registration you will be able to apply for this opportunity online.
Combining Low-Level Sensors with Large Vision-Language Models for Context-Aware Human-Robot Collaboration
The goal of this thesis is to create a system that combines multiple RGB-D cameras and IMU sensors attached to a human to create a context-aware robot arm support system for manual assembly. By combining the low-level sensor information with large vision-language models, such as CLIP and GPT4, this thesis will explore how to extract context-awareness from manual workflows and use this for optimal autonomous robot assistance.
Keywords: Computer vision, Machine Learning, Deep Learning, Robotics, Human Robot Collaboration, CLIP, GPT, Large Vision-Language Models
This project is done in a collaboration with the Accenture Digital Experiences Lab.
With the emergence of collaborative robots (cobots), robotic systems can work in direct interaction with humans and assist them during manual workflows. However, to enable a seamless collaboration, robot systems need be context-aware and make sense of their environment and the human actions within it. To achieve this, the goal of this project is to create a system that combines the information from visual sensors (multiple RGB-D cameras) and IMU sensors with large visual-language models to create a context aware system. The system will feed low-level predictions, such as activity recognition and extracted visual bounding boxes forward to large vision-language models to make sense of the environment and understand the temporal context of the human. Subsequently, a LLM based robotic task planner will utilize this to plan the next assistive robot action.
This project is done in a collaboration with the Accenture Digital Experiences Lab.
With the emergence of collaborative robots (cobots), robotic systems can work in direct interaction with humans and assist them during manual workflows. However, to enable a seamless collaboration, robot systems need be context-aware and make sense of their environment and the human actions within it. To achieve this, the goal of this project is to create a system that combines the information from visual sensors (multiple RGB-D cameras) and IMU sensors with large visual-language models to create a context aware system. The system will feed low-level predictions, such as activity recognition and extracted visual bounding boxes forward to large vision-language models to make sense of the environment and understand the temporal context of the human. Subsequently, a LLM based robotic task planner will utilize this to plan the next assistive robot action.
You will be introduced to our robot system and the sensors available (ZED 2i cameras, smartwatch IMU). Your tasks include:
- Implementing Low-Level ML: Human Pose Estimation, Semantic Segmentation, Activity Recognition
- Integration with Large Visual-Language Models: GPT, LLAMA, CLIP, ...
- Implementing a Robot Task Planner based on context-aware predictions from the LLM
You will be introduced to our robot system and the sensors available (ZED 2i cameras, smartwatch IMU). Your tasks include:
- Integration with Large Visual-Language Models: GPT, LLAMA, CLIP, ...
- Implementing a Robot Task Planner based on context-aware predictions from the LLM
- Strong programming skills (Python, C#, C++, …) - An interest in robotics - Experience with machine learning, data science or computer vision - The ability to take initiative and shape the direction of the project - Enthusiasm for tackling practical challenges
As part of our research at the AR Lab within the Human Behavior Group we are working on automatically analyzing a user’s interaction with his environment in scenarios such as surgery or in industrial machine interactions. By collecting real-world datasets during those scenarios and using them for machine learning tasks such as activity recognition, object pose estimation or image segmentation we can gain an understanding of how a user performed during a given task. We can then utilize this information to provide the user with real-time feedback on his task using mixed reality devices, such as the Microsoft HoloLens, that can guide him and prevent him from doing mistakes.
As part of our research at the AR Lab within the Human Behavior Group we are working on automatically analyzing a user’s interaction with his environment in scenarios such as surgery or in industrial machine interactions. By collecting real-world datasets during those scenarios and using them for machine learning tasks such as activity recognition, object pose estimation or image segmentation we can gain an understanding of how a user performed during a given task. We can then utilize this information to provide the user with real-time feedback on his task using mixed reality devices, such as the Microsoft HoloLens, that can guide him and prevent him from doing mistakes.
- Collaboration with Accenture Digital Experiences Lab - Master Thesis - ML / CV - Large Vision-Language Models - Robot Task Planning - Human-Robot Collaboration
Please send me your CV and masters grades (ktistaks@ethz.ch)
Please send me your CV and masters grades (ktistaks@ethz.ch)