Register now After registration you will be able to apply for this opportunity online.
This opportunity is not published. No applications will be accepted.
Quantizing Vision and Language Foundation Models for Efficient Inference
Foundation models are a breakthrough in the field of artificial intelligence. These models are characterized by massive size, reaching billions (even trillions) of parameters, and by the ability to be adapted to a wide variety of tasks without needing to be trained from scratch. The development of these models marks a pivotal shift in AI research and application, pushing the boundaries of what machines can understand and do. However, due to the huge size of foundation model, they are very demanding in terms of computation, memory footprint and bandwidth. For this reason, foundation models face significant computational challenges. These models are typically trained on massive clusters equipped with thousands of advanced GPUs. Moreover, they require cloud services for inference as well.
Keywords: Foundation model, SAM2, model quantization, VLM, LLM
**Methods**: The project target mixed-precision quantization methods within a quantization-aware training framework, specifically adapted to the BitNet training and QLoRA finetuning process. Mixed-precision quantization allows for different parts of the model to be quantized at different levels, enabling a tailored approach where critical components of the model can retain higher precision to pre-serve essential information and model integrity, while less critical components can be quantized more aggressively to achieve greater reductions in memory usage and computational demand. The motivation behind using mixed-precision quantization lies in its potential to find Pareto-optimal points in the trade-offs between model size, computational efficiency, and performance accuracy. By selectively applying different quantization strategies across the model, it becomes possible to maintain or even enhance the model's effectiveness while still benefiting from the efficiency gains of lower-bit quantization.
**Materials and Resources**: The candidate will join the research team with extensive experience in machine learning, computer vision. The candidate will have the opportunity to work with active researchers and to be supervised by world-leading professors and senior researchers. Access to high-performance supercomputers equipped with rich GPU resource will be possible.
**Nature of the Thesis**:
- Literature review: 10%;
- Model building: 70%;
- Model validation: 10%;
- Results analysis: 10%
**Requirements**:
- Familiar with Python, Pytorch;
- Knowledge in machine learning, deep learning;
- Experience with training deep learning models;
- Knowledge with Transformers and Mamba is a bonus;
- Knowledge with Pytorch-Lightning is a bonus.
**Supervisors**:
- Dr. Yawei Li (yawli@ethz.ch)
- Dr. Guolei Sun (guolei.sun@vision.ee.ethz.ch)
**Professors**:
- Prof. Luca Benini (lbenini@iis.ee.ethz.ch)
- Prof. Ender Konukoglu (kender@vision.ee.ethz.ch)
**Institutes**:
- Integrated System Lab & Computer Vision Lab, D-ITET, ETH Zurich
**References**:
[1] Pandey, Nilesh Prasad, et al. "A practical mixed precision algorithm for post-training quantization." arXiv preprint arXiv:2302.05397 (2023).
[2] Van Baalen, Mart, et al. "Bayesian bits: Unifying quantization and pruning." Advances in neural information processing systems 33 (2020): 5741-5752.
[3] Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021).
[4] Wang, Hongyu, et al. "Bitnet: Scaling 1-bit transformers for large language models." arXiv preprint arXiv:2310.11453 (2023).
[5] Ma, Shuming, et al. "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits." arXiv preprint arXiv:2402.17764 (2024).
**Methods**: The project target mixed-precision quantization methods within a quantization-aware training framework, specifically adapted to the BitNet training and QLoRA finetuning process. Mixed-precision quantization allows for different parts of the model to be quantized at different levels, enabling a tailored approach where critical components of the model can retain higher precision to pre-serve essential information and model integrity, while less critical components can be quantized more aggressively to achieve greater reductions in memory usage and computational demand. The motivation behind using mixed-precision quantization lies in its potential to find Pareto-optimal points in the trade-offs between model size, computational efficiency, and performance accuracy. By selectively applying different quantization strategies across the model, it becomes possible to maintain or even enhance the model's effectiveness while still benefiting from the efficiency gains of lower-bit quantization.
**Materials and Resources**: The candidate will join the research team with extensive experience in machine learning, computer vision. The candidate will have the opportunity to work with active researchers and to be supervised by world-leading professors and senior researchers. Access to high-performance supercomputers equipped with rich GPU resource will be possible.
**Nature of the Thesis**: - Literature review: 10%; - Model building: 70%; - Model validation: 10%; - Results analysis: 10%
**Requirements**:
- Familiar with Python, Pytorch; - Knowledge in machine learning, deep learning; - Experience with training deep learning models; - Knowledge with Transformers and Mamba is a bonus; - Knowledge with Pytorch-Lightning is a bonus.
**Supervisors**:
- Dr. Yawei Li (yawli@ethz.ch) - Dr. Guolei Sun (guolei.sun@vision.ee.ethz.ch)
**Professors**:
- Prof. Luca Benini (lbenini@iis.ee.ethz.ch) - Prof. Ender Konukoglu (kender@vision.ee.ethz.ch)
**Institutes**:
- Integrated System Lab & Computer Vision Lab, D-ITET, ETH Zurich
**References**:
[1] Pandey, Nilesh Prasad, et al. "A practical mixed precision algorithm for post-training quantization." arXiv preprint arXiv:2302.05397 (2023).
[2] Van Baalen, Mart, et al. "Bayesian bits: Unifying quantization and pruning." Advances in neural information processing systems 33 (2020): 5741-5752.
[3] Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021).
[4] Wang, Hongyu, et al. "Bitnet: Scaling 1-bit transformers for large language models." arXiv preprint arXiv:2310.11453 (2023).
[5] Ma, Shuming, et al. "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits." arXiv preprint arXiv:2402.17764 (2024).
Quantization is an effective technique to reduce the stored model size of foundation models and accelerate their inference. For example, a 70B Llama model approximately needs 150GB GPU memory with 16-bit floating point weights, which requires two A100 80G GPUs for inference. If the model is quantized to 4 bits, the required GPU memory can be reduced to 35GB and allows the model to fit on a single GPU with less memory. Thus, in this project, we aim to further release the power of quantization, shrinking the vision and language foundation model size and accelerating their inference. We will consider vision and language foundation models such as SAM2 and LLAVA.
Quantization is an effective technique to reduce the stored model size of foundation models and accelerate their inference. For example, a 70B Llama model approximately needs 150GB GPU memory with 16-bit floating point weights, which requires two A100 80G GPUs for inference. If the model is quantized to 4 bits, the required GPU memory can be reduced to 35GB and allows the model to fit on a single GPU with less memory. Thus, in this project, we aim to further release the power of quantization, shrinking the vision and language foundation model size and accelerating their inference. We will consider vision and language foundation models such as SAM2 and LLAVA.
Dr. Yawei Li (yawli@ethz.ch)
Dr. Guolei Sun (guolei.sun@vision.ee.ethz.ch)