Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,93 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<p align="center">
|
| 2 |
+
<img src="assets/pic/PUMA.png" width="230">
|
| 3 |
+
</p>
|
| 4 |
+
|
| 5 |
+
# PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation
|
| 6 |
+
|
| 7 |
+
<div align="center">
|
| 8 |
+
<a href="https://rongyaofang.github.io/puma/"><img src="https://img.shields.io/badge/Project-Homepage-green" alt="Home"></a>
|
| 9 |
+
<a href="https://arxiv.org/abs/2410.13861"><img src="https://img.shields.io/badge/ArXiv-2410.13861-red"></a>
|
| 10 |
+
<img src="https://visitor-badge.laobi.icu/badge?page_id=rongyaofang/PUMA" alt="visitors">
|
| 11 |
+
|
| 12 |
+
[Rongyao Fang](https://scholar.google.com/citations?user=FtH3CW4AAAAJ&hl=en)<sup>1\*</sup>, [Chengqi Duan](https://scholar.google.com/citations?user=r9qb4ZwAAAAJ&hl=zh-CN)<sup>2\*</sup>, [Kun Wang]()<sup>3</sup>, [Hao Li](https://scholar.google.com/citations?user=qHqQsY4AAAAJ&hl=zh-CN)<sup>1,4</sup>, [Hao Tian]()<sup>3</sup>, [Xingyu Zeng]()<sup>3</sup>, [Rui Zhao]()<sup>3</sup>, [Jifeng Dai](https://jifengdai.org/)<sup>4,5</sup>, [Hongsheng Li](https://www.ee.cuhk.edu.hk/~hsli/)<sup>1 :envelope:</sup>, [Xihui Liu](https://xh-liu.github.io/)<sup>2 :envelope:</sup>
|
| 13 |
+
|
| 14 |
+
<sup>1</sup>CUHK MMLab, <sup>2</sup>HKU MMLab, <sup>3</sup>SenseTime, <sup>4</sup>Shanghai AI Laboratory, <sup>5</sup>Tsinghua University
|
| 15 |
+
|
| 16 |
+
*Equal contribution, :envelope:Corresponding authors
|
| 17 |
+
</div>
|
| 18 |
+
|
| 19 |
+
## <a name="env"></a>Environment Setup
|
| 20 |
+
```
|
| 21 |
+
conda create -n puma python==3.8
|
| 22 |
+
conda activate puma
|
| 23 |
+
pip install -r requirements.txt
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
## <a name="checkpoint"></a>Checkpoint Download
|
| 27 |
+
```
|
| 28 |
+
# You should first replace the <token> with your huggingface token
|
| 29 |
+
python download_ckpt.py
|
| 30 |
+
```
|
| 31 |
+
For manual downloads, please download checkpoints from [here](https://huggingface.co/LucasFang/PUMA) and put the checkpoints under **./ckpts**.
|
| 32 |
+
|
| 33 |
+
## <a name="multi-granular"></a>Multi-granular Visual Decoding
|
| 34 |
+
```
|
| 35 |
+
python infer_detokenizer.py --num_tokens <chosen number from [1, 4, 16, 64, 256]>
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
## <a name="abstract"></a>Abstract
|
| 39 |
+
|
| 40 |
+
> **PUMA** introduces a unified multimodal large language model framework designed to integrate multi-granular visual generation and understanding. Our model excels in a variety of visual tasks, including diverse text-to-image generation, precise image editing, conditional image generation, and visual understanding. It strikes a balance between generation diversity and controllability, making it a versatile tool for visual tasks.
|
| 41 |
+
|
| 42 |
+
Read the full paper [here](https://arxiv.org/abs/2410.13861).
|
| 43 |
+
|
| 44 |
+
## <a name="framework"></a>Framework
|
| 45 |
+
|
| 46 |
+
<p align="center">
|
| 47 |
+
<img src="assets/pic/main_figure.jpg" width="920">
|
| 48 |
+
</p>
|
| 49 |
+
|
| 50 |
+
- PUMA leverages multi-granular visual representations as unified inputs and outputs for MLLM, allowing it to handle a variety of visual tasks, including text-to-image generation, image editing, inpainting, colorization, conditional generation, and image understanding.
|
| 51 |
+
|
| 52 |
+
## <a name="decoding"></a>Multi-granular Semantic Visual Decoding
|
| 53 |
+
|
| 54 |
+
<p align="center">
|
| 55 |
+
<img src="assets/pic/rec.jpg" width="920">
|
| 56 |
+
</p>
|
| 57 |
+
|
| 58 |
+
- PUMA's visual decoding process spans five granular image representations (f<sub>0</sub> to f<sub>4</sub>) and corresponding decoders (D<sub>0</sub> to D<sub>4</sub>), which are trained using SDXL. This allows PUMA to achieve precise image reconstruction and semantic-guided generation, supporting both control and diversity in image generation tasks.
|
| 59 |
+
|
| 60 |
+
## <a name="t2i"></a>Diverse Text-to-image Generation
|
| 61 |
+
|
| 62 |
+
<p align="center">
|
| 63 |
+
<img src="assets/pic/gen.jpg" width="920">
|
| 64 |
+
</p>
|
| 65 |
+
|
| 66 |
+
## <a name="image_editing"></a>Image Editing
|
| 67 |
+
|
| 68 |
+
<p align="center">
|
| 69 |
+
<img src="assets/pic/edit.jpg" width="920">
|
| 70 |
+
</p>
|
| 71 |
+
|
| 72 |
+
## <a name="cond_gen"></a>Image Conditional Generation
|
| 73 |
+
|
| 74 |
+
<p align="center">
|
| 75 |
+
<img src="assets/pic/cond_gen.jpg" width="920">
|
| 76 |
+
</p>
|
| 77 |
+
|
| 78 |
+
## <a name="citation"></a>Citation
|
| 79 |
+
|
| 80 |
+
If you find PUMA useful in your research, please consider citing us:
|
| 81 |
+
|
| 82 |
+
```
|
| 83 |
+
@article{fang2024puma,
|
| 84 |
+
title ={PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation},
|
| 85 |
+
author ={Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu},
|
| 86 |
+
journal ={arxiv},
|
| 87 |
+
year ={2024}
|
| 88 |
+
}
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
## <a name="license"></a>License
|
| 92 |
+
|
| 93 |
+
This project is released under the [Apache 2.0 license](LICENSE).
|