Register now After registration you will be able to apply for this opportunity online.
This opportunity is not published. No applications will be accepted.
A Close Look at Domain Shift in Point Cloud Registration
3D sensors, like stereo cameras or laser scanners, capture 3D point clouds. Aligning such point clouds recorded from different viewpoints is the basis of 3D scene reconstruction, and thus an elementary capability in computer vision, computer graphics, robotics and mapping. The core component of the alignment process is feature matching: the ability to find corresponding features in different point clouds, by robustly describing and comparing the local geometry. This task is complicated by variations of the scene context and the sensors' sampling patterns. Modern feature descriptors are learned in a data-driven manner with deep neural networks, and fall into two groups: fully convolutional methods (like FCGF[1], D3Feat[3], PREDATOR[2], are computationally more efficient, but do not generalise as well - for instance, when trained on indoor data they do not work as well on outdoor data; whereas patch-based descriptors like PerfectMatch[4], DIP[7], SpinNet[5], are less efficient, but generalise well across different datasets. This trade-off raises our main research question: What makes fully convolution features less generalisable compared to patch-based counterparts? Our hypothesis is that the problem has to do with the very large receptive field of fully convolutional methods.
Keywords: domain adaptation, point cloud, feature matching, 3D Vision, machine learning
In the thesis, we aim to answer the following questions:
- What is the effective receptive field size of fully convolution local features? Are these features less general because the large receptive field gives them too much awareness of the scene-specific global context?
- To which extent are fully convolutional features semantics-aware? Are they overly reliant on scene-specific semantic object types, like cars (outdoors) or furniture (indoors)? How does their semantic awareness compare to patch-based features?
- Can more robust features be found by jointly training the feature extractor over large-scale indoor and outdoor datasets? How much semantic information is left after such joint training?
- Can multi-scale features and adaptive feature selection[6] help to balance local shape against global context, and achieve better generalisation?
- Can continuous adaptation of the feature extractor, and test-time learning, help to improve 3D point cloud registration?
For development and empirical evaluation, an indoor dataset (3DMatch) and an outdoor dataset (KITTI) are available in analysis-ready form. After an initial analysis with those two datasets, we aim to move on to a new, more diverse dataset available at ETH (called Nothing Stands Still), which contains multiple revisits of the same, dynamically changing scnenes and additionally features temporal domain shift.
Supervisors:
Shengyu Huang, ETH Zurich, shengyu.huang@geod.baug.ethz.ch
Xuyang Bai, HKUST, xbaiad@connect.ust.hk
Dr. Theodora Kontogianni, ETH Zurich, theodora.kontogianni@inf.ethz.ch
Prof. Dr. Konrad Schindler, ETH Zurich, konrad.schindler@geod.baug.ethz.ch
Reference:
[1] Choy, C., Park, J., & Koltun, V. (2019). Fully convolutional geometric features. In Proceedings of the IEfEE/CVF International Conference on Computer Vision (pp. 8958-8966).
[2] Huang, S., Gojcic, Z., Usvyatsov, M., Wieser, A., & Schindler, K. (2021). Predator: Registration of 3d point clouds with low overlap. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4267-4276).
[3] Bai, X., Luo, Z., Zhou, L., Fu, H., Quan, L., & Tai, C. L. (2020). D3feat: Joint learning of dense detection and description of 3d local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6359-6367).
[4] Gojcic, Z., Zhou, C., Wegner, J. D., & Wieser, A. (2019). The perfect match: 3d point cloud matching with smoothed densities. In Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition (pp. 5545-5554).
[5] Ao, S., Hu, Q., Yang, B., Markham, A., & Guo, Y. (2021). Spinnet: Learning a general surface descriptor for 3d point cloud registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11753-11762).
[6] Zhu, L., Guan, H., Lin, C., & Han, R. (2022). Neighborhood-aware Geometric Encoding Network for Point Cloud Registration. arXiv preprint arXiv:2201.12094.
[7] Poiesi, F., & Boscaini, D. (2021). Generalisable and distinctive 3D local deep descriptors for point cloud registration. arXiv preprint arXiv:2105.10382.
In the thesis, we aim to answer the following questions:
- What is the effective receptive field size of fully convolution local features? Are these features less general because the large receptive field gives them too much awareness of the scene-specific global context?
- To which extent are fully convolutional features semantics-aware? Are they overly reliant on scene-specific semantic object types, like cars (outdoors) or furniture (indoors)? How does their semantic awareness compare to patch-based features?
- Can more robust features be found by jointly training the feature extractor over large-scale indoor and outdoor datasets? How much semantic information is left after such joint training?
- Can multi-scale features and adaptive feature selection[6] help to balance local shape against global context, and achieve better generalisation?
- Can continuous adaptation of the feature extractor, and test-time learning, help to improve 3D point cloud registration?
For development and empirical evaluation, an indoor dataset (3DMatch) and an outdoor dataset (KITTI) are available in analysis-ready form. After an initial analysis with those two datasets, we aim to move on to a new, more diverse dataset available at ETH (called Nothing Stands Still), which contains multiple revisits of the same, dynamically changing scnenes and additionally features temporal domain shift.
Supervisors:
Shengyu Huang, ETH Zurich, shengyu.huang@geod.baug.ethz.ch
Xuyang Bai, HKUST, xbaiad@connect.ust.hk
Dr. Theodora Kontogianni, ETH Zurich, theodora.kontogianni@inf.ethz.ch
Prof. Dr. Konrad Schindler, ETH Zurich, konrad.schindler@geod.baug.ethz.ch
Reference:
[1] Choy, C., Park, J., & Koltun, V. (2019). Fully convolutional geometric features. In Proceedings of the IEfEE/CVF International Conference on Computer Vision (pp. 8958-8966).
[2] Huang, S., Gojcic, Z., Usvyatsov, M., Wieser, A., & Schindler, K. (2021). Predator: Registration of 3d point clouds with low overlap. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4267-4276).
[3] Bai, X., Luo, Z., Zhou, L., Fu, H., Quan, L., & Tai, C. L. (2020). D3feat: Joint learning of dense detection and description of 3d local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6359-6367).
[4] Gojcic, Z., Zhou, C., Wegner, J. D., & Wieser, A. (2019). The perfect match: 3d point cloud matching with smoothed densities. In Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition (pp. 5545-5554).
[5] Ao, S., Hu, Q., Yang, B., Markham, A., & Guo, Y. (2021). Spinnet: Learning a general surface descriptor for 3d point cloud registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11753-11762).
[6] Zhu, L., Guan, H., Lin, C., & Han, R. (2022). Neighborhood-aware Geometric Encoding Network for Point Cloud Registration. arXiv preprint arXiv:2201.12094.
[7] Poiesi, F., & Boscaini, D. (2021). Generalisable and distinctive 3D local deep descriptors for point cloud registration. arXiv preprint arXiv:2105.10382.
With the thesis, we aim to 1) develop a better understanding of the limitations and the potential of fully convolutional features; 2) understand the role of semantic information in feature matching; 3) Develop novel training recipes or network architectures that enable alignment and 3D scene reconstruction across a range of scene types.
With the thesis, we aim to 1) develop a better understanding of the limitations and the potential of fully convolutional features; 2) understand the role of semantic information in feature matching; 3) Develop novel training recipes or network architectures that enable alignment and 3D scene reconstruction across a range of scene types.
Shengyu Huang, ETH Zurich, shengyu.huang@geod.baug.ethz.ch
Shengyu Huang, ETH Zurich, shengyu.huang@geod.baug.ethz.ch