Register now After registration you will be able to apply for this opportunity online.
Combining Low-Level Sensors with Large Vision-Language Models for Context-Aware Human-Robot Collaboration
The goal of this thesis is to create a system that combines multiple RGB-D cameras and IMU sensors attached to a human to create a context-aware robot arm support system for manual assembly. By combining the low-level sensor information with large vision-language models, such as CLIP and GPT4, this thesis will explore how to extract context-awareness from manual workflows and use this for optimal autonomous robot assistance.
Keywords: Computer vision, Machine Learning, Deep Learning, Robotics, Human Robot Collaboration, CLIP, GPT, Large Vision-Language Models
This project is done in a collaboration with the Accenture Digital Experiences Lab.
With the emergence of collaborative robots (cobots), robotic systems can work in direct interaction with humans and assist them during manual workflows. However, to enable a seamless collaboration, robot systems need be context-aware and make sense of their environment and the human actions within it. To achieve this, the goal of this project is to create a system that combines the information from visual sensors (multiple RGB-D cameras) and IMU sensors with large visual-language models to create a context aware system. The system will feed low-level predictions, such as activity recognition and extracted visual bounding boxes forward to large vision-language models to make sense of the environment and understand the temporal context of the human. Subsequently, a robotic task planner will utilize this to plan the next assistive robot action. The system will consist of the following components:
**Low Level Machine Learning:**
1. Semantic image segmentation & bounding-box extraction (ex. Facebook Segment Anything)
2. Human skeleton extraction (ex. OpenPose, ZED camera API)
3. Activity Recognition from IMU and skeleton (ex. LSTMs)
**Large Vision-Language Models:**
1. CLIP model for spatial context-awareness
2. GPT-4 for temporal context-awareness
**Context-Aware Robot Task Planner:**
1. Use GPT-4 to predict next robot tasks
2. Use existing robot API to execute task on the robot
This project is done in a collaboration with the Accenture Digital Experiences Lab.
With the emergence of collaborative robots (cobots), robotic systems can work in direct interaction with humans and assist them during manual workflows. However, to enable a seamless collaboration, robot systems need be context-aware and make sense of their environment and the human actions within it. To achieve this, the goal of this project is to create a system that combines the information from visual sensors (multiple RGB-D cameras) and IMU sensors with large visual-language models to create a context aware system. The system will feed low-level predictions, such as activity recognition and extracted visual bounding boxes forward to large vision-language models to make sense of the environment and understand the temporal context of the human. Subsequently, a robotic task planner will utilize this to plan the next assistive robot action. The system will consist of the following components:
**Low Level Machine Learning:**
1. Semantic image segmentation & bounding-box extraction (ex. Facebook Segment Anything) 2. Human skeleton extraction (ex. OpenPose, ZED camera API) 3. Activity Recognition from IMU and skeleton (ex. LSTMs)
**Large Vision-Language Models:**
1. CLIP model for spatial context-awareness 2. GPT-4 for temporal context-awareness
**Context-Aware Robot Task Planner:**
1. Use GPT-4 to predict next robot tasks 2. Use existing robot API to execute task on the robot
1. System Integration: ZED 2i cameras, Smartwatch IMU
2. Implementing Low-Level ML: Human Pose Estimation, Semantic Segmentation, Activity Recognition
3. Integration with Large Visual-Language Models: GPT-4 and CLIP
4. Implement a Robot Task Planner based on context-aware predictions from GPT-4
1. System Integration: ZED 2i cameras, Smartwatch IMU
3. Integration with Large Visual-Language Models: GPT-4 and CLIP
4. Implement a Robot Task Planner based on context-aware predictions from GPT-4
... a very autonomous and methodical way of working. You know how to structure a project, how to derive meaningful work packages and how to systematically develop solutions. … Previous project experience with computer vision and machine learning (PyTorch, OpenCV, Open3D, ...)
As part of our research at the AR Lab within the Human Behavior Group we are working on automatically analyzing a user’s interaction with his environment in scenarios such as surgery or in industrial machine interactions. By collecting real-world datasets during those scenarios and using them for machine learning tasks such as activity recognition, object pose estimation or image segmentation we can gain an understanding of how a user performed during a given task. We can then utilize this information to provide the user with real-time feedback on his task using mixed reality devices, such as the Microsoft HoloLens, that can guide him and prevent him from doing mistakes.
As part of our research at the AR Lab within the Human Behavior Group we are working on automatically analyzing a user’s interaction with his environment in scenarios such as surgery or in industrial machine interactions. By collecting real-world datasets during those scenarios and using them for machine learning tasks such as activity recognition, object pose estimation or image segmentation we can gain an understanding of how a user performed during a given task. We can then utilize this information to provide the user with real-time feedback on his task using mixed reality devices, such as the Microsoft HoloLens, that can guide him and prevent him from doing mistakes.
- Collaboration with Accenture Digital Experiences Lab - Master Thesis - ML / CV - Large Vision-Language Models - Robot Task Planning - Human-Robot Collaboration
Please send me your CV and masters grades (ktistaks@ethz.ch)
Please send me your CV and masters grades (ktistaks@ethz.ch)