Register now After registration you will be able to apply for this opportunity online.
Human-Robot Communication with Text Prompts and 3D Scene Graphs
This project extends previous work [a] on calculating similarity scores between text prompts and 3D scene graphs representing environments. The current method identifies potential locations based on user descriptions, aiding human-agent communication, but is limited by its coarse localization and inability to refine estimates incrementally. This project aims to enhance the method by enabling it to return potential locations within a 3D map and incorporate additional user information to improve localization accuracy incrementally until a confident estimate is achieved.
[a] Chen, J., Barath, D., Armeni, I., Pollefeys, M., & Blum, H. (2024). "Where am I?" Scene Retrieval with Language. ECCV 2024.
Keywords: 3D scene graph, LLM, localization
The project builds on the work recently published in [a], which addresses the problem of calculating a similarity score between a text prompt and a scene graph representing an environment. In [a], the algorithm interprets user descriptions, such as "I see a blue chair next to a table with two monitors standing on it," and identifies potential locations (e.g., rooms) within the environment that correspond to the description. This approach is particularly useful for enhancing human-agent communication, enabling commands like directing an agent to a specific location to perform a task. However, the current algorithm has limitations: it only provides a coarse estimate of the location and cannot refine this estimate incrementally as more information becomes available.
The primary objectives of this project are two-fold. First, the method will be extended to return a set of potential locations within a 3D map of the environment. Second, the project will develop a system for incremental localization, allowing the method to incorporate additional user-provided information when initial confidence in the localization is low, thereby improving the accuracy of the estimate until a confident localization is achieved. This effectively means that the user will keep describing the location until the algorithm is confident enough to localize it.
[a] Chen, J., Barath, D., Armeni, I., Pollefeys, M., & Blum, H. (2024). "Where am I?" Scene Retrieval with Language. ECCV 2024.
The project builds on the work recently published in [a], which addresses the problem of calculating a similarity score between a text prompt and a scene graph representing an environment. In [a], the algorithm interprets user descriptions, such as "I see a blue chair next to a table with two monitors standing on it," and identifies potential locations (e.g., rooms) within the environment that correspond to the description. This approach is particularly useful for enhancing human-agent communication, enabling commands like directing an agent to a specific location to perform a task. However, the current algorithm has limitations: it only provides a coarse estimate of the location and cannot refine this estimate incrementally as more information becomes available.
The primary objectives of this project are two-fold. First, the method will be extended to return a set of potential locations within a 3D map of the environment. Second, the project will develop a system for incremental localization, allowing the method to incorporate additional user-provided information when initial confidence in the localization is low, thereby improving the accuracy of the estimate until a confident localization is achieved. This effectively means that the user will keep describing the location until the algorithm is confident enough to localize it.
[a] Chen, J., Barath, D., Armeni, I., Pollefeys, M., & Blum, H. (2024). "Where am I?" Scene Retrieval with Language. ECCV 2024.