Description
The recent advent of latent diffusion models [1, 2, 3] boosted and facilitate great advances in generative modeling for both images and videos.
In this project, we plan to make use of modern pixel-to-sequence interfaces [4] to create a general interface for multiple vision tasks; i) diffusion for discrete data [5] (necessary for generating discrete segmentation masks) and ii) in panoptic diffusion [6] to train a universal panoptic video segmentation system.
The goal of the thesis is to identify the most promising network architecture, and fine-tune it on our dataset, which was collected in construction site-like environments. The performance and training behaviour should be analysed and compared to existing work in the domain of panoptic segmentation conducted in our lab in the past year.
References
[1] High-Resolution Image Synthesis with Latent Diffusion Models
[2] Imagen Video: High Definition Video Generation Diffusion models, Google
[3] Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions, Google
[4] A Unified Sequence Interface for Vision Tasks, Google Brain
[5] Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning, Google Brain
[6] A Generalist Framework for Panoptic Segmentation of Images and Videos, Google Brain
Description The recent advent of latent diffusion models [1, 2, 3] boosted and facilitate great advances in generative modeling for both images and videos. In this project, we plan to make use of modern pixel-to-sequence interfaces [4] to create a general interface for multiple vision tasks; i) diffusion for discrete data [5] (necessary for generating discrete segmentation masks) and ii) in panoptic diffusion [6] to train a universal panoptic video segmentation system. The goal of the thesis is to identify the most promising network architecture, and fine-tune it on our dataset, which was collected in construction site-like environments. The performance and training behaviour should be analysed and compared to existing work in the domain of panoptic segmentation conducted in our lab in the past year.
References [1] High-Resolution Image Synthesis with Latent Diffusion Models [2] Imagen Video: High Definition Video Generation Diffusion models, Google [3] Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions, Google [4] A Unified Sequence Interface for Vision Tasks, Google Brain [5] Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning, Google Brain [6] A Generalist Framework for Panoptic Segmentation of Images and Videos, Google Brain
Identify - most promising architectures
- Generate segmentation masks with a discrete diffusion model
- Explore trade-offs in diffusion models: samplers, diffusion steps, architectures
- Finetune in on our dataset
- Evaluation and deployment of the video segmentation system, and comparison to prior work
Identify - most promising architectures - Generate segmentation masks with a discrete diffusion model - Explore trade-offs in diffusion models: samplers, diffusion steps, architectures - Finetune in on our dataset - Evaluation and deployment of the video segmentation system, and comparison to prior work
- Experience in training neural networks
- Experience with diffusion/generative models is a plus
- Experience with ROS is a plus
-
- Experience in training neural networks - Experience with diffusion/generative models is a plus - Experience with ROS is a plus -