The objective of this project is to determine the metric relative pose, comprising 3D rotation and translation, between two images. Classical computer vision techniques are unable to recover the scale due to insufficient geometric constraints. This limitation largely complicates tasks such as 3D reconstruction, where the scale of translation is critical for positioning the cameras. With the development of semantic segmentation models, obtaining object-level image segmentations is becoming commonly available. Our project seeks to leverage object-level segmentation cues to achieve accurate metric relative pose estimation by matching objects and local features. Specifically, object-level information allows us to extract object-aware local features as well as handle large-scale differences caused by extreme viewpoint changes. This leads us to more accurate correspondence matching. On the other hand, successfully identifying common items like monitors or sofas enables us to derive an approximate scale of the observed scene. With the approximate scale information, we can extract metric relative pose from the matched correspondences using an additional pre-trained monocular depth estimation model.
The objective of this project is to determine the metric relative pose, comprising 3D rotation and translation, between two images. Classical computer vision techniques are unable to recover the scale due to insufficient geometric constraints. This limitation largely complicates tasks such as 3D reconstruction, where the scale of translation is critical for positioning the cameras. With the development of semantic segmentation models, obtaining object-level image segmentations is becoming commonly available. Our project seeks to leverage object-level segmentation cues to achieve accurate metric relative pose estimation by matching objects and local features. Specifically, object-level information allows us to extract object-aware local features as well as handle large-scale differences caused by extreme viewpoint changes. This leads us to more accurate correspondence matching. On the other hand, successfully identifying common items like monitors or sofas enables us to derive an approximate scale of the observed scene. With the approximate scale information, we can extract metric relative pose from the matched correspondences using an additional pre-trained monocular depth estimation model.
Not specified
This project is a collaboration with Qunjie and Laura from NVIDIA.
Daniel Barath (dbarath@ethz.ch)
Qunjie Zhou (qunjiez@nvidia.com)
Laura Leal-Taixe (llealtaixe@nvidia.com)
This project is a collaboration with Qunjie and Laura from NVIDIA.
Daniel Barath (dbarath@ethz.ch) Qunjie Zhou (qunjiez@nvidia.com) Laura Leal-Taixe (llealtaixe@nvidia.com)