MM-MVR commited on
Commit
ecae24a
ยท
verified ยท
1 Parent(s): bb69dc3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +203 -3
README.md CHANGED
@@ -1,3 +1,203 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+
6
+ <p align="center">
7
+ <img src="assets/star_logo.png" alt="STAR" width="560"/>
8
+ </p>
9
+
10
+ <p align="center">
11
+ <a href="https://arxiv.org/abs/xxxx.xxxxx">
12
+ <img
13
+ src="https://img.shields.io/badge/STAR-Paper-red?logo=arxiv&logoColor=red"
14
+ alt="STAR Paper on arXiv"
15
+ />
16
+ </a>
17
+ <a href="#">
18
+ <img
19
+ src="https://img.shields.io/badge/STAR-Project-0A66C2?logo=safari&logoColor=white"
20
+ alt="STAR Project"
21
+ />
22
+ </a>
23
+ <a href="#">
24
+ <img
25
+ src="https://img.shields.io/badge/STAR-Models-yellow?logo=huggingface&logoColor=yellow"
26
+ alt="STAR Models"
27
+ />
28
+ </a>
29
+ <a href="#">
30
+ <img
31
+ src="https://img.shields.io/badge/STAR-Demo-blue?logo=googleplay&logoColor=blue"
32
+ alt="STAR Demo"
33
+ />
34
+ </a>
35
+ <a href="#">
36
+ <img
37
+ src="https://img.shields.io/badge/STAR-Space-orange?logo=huggingface&logoColor=yellow"
38
+ alt="STAR HuggingFace Space"
39
+ />
40
+ </a>
41
+ </p>
42
+
43
+ # **STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning**
44
+
45
+
46
+ Welcome to the official repository for our paper: "STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning"
47
+
48
+
49
+ ## **Abstract**
50
+ Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce ***STAR***: *a **ST**acked **A**uto**R**egressive scheme for task-progressive unified multimodal learning*. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (**0.91**), DPG-Bench (**87.44**), and ImgEdit (**4.34**), validating its efficacy for unified multimodal learning.
51
+
52
+ <div align="center">
53
+ <img src="assets/teaser.png" width=100%></img>
54
+ </div>
55
+
56
+
57
+ ## ๐ŸŒŸ Model Checkpoint
58
+
59
+
60
+ | Model Name | Checkpoint | Config |
61
+ | :--------: | :--------: | :----: |
62
+ | STAR-3B | [Link](#) | [Config](star/configs/STAR_Qwen2.5-VL-3B.json) |
63
+ | STAR-7B | [Link](#) | [Config](star/configs/STAR_Qwen2.5-VL-7B.json) |
64
+ | VQ Model | [Link](#) | - |
65
+
66
+
67
+ ## ๐Ÿ“š Preparation
68
+
69
+ ### Prepare the environment
70
+
71
+ 1. Set up environment
72
+ ```shell
73
+ git clone <repository-url>
74
+ cd STAR
75
+ conda create -n star python==3.11 -y
76
+ conda activate star
77
+ ```
78
+
79
+ 2. Install the required packages:
80
+ ```shell
81
+ # upgrade pip and setuptools if necessary
82
+ pip install -U pip setuptools
83
+ # install required packages
84
+ pip install -r requirements.txt
85
+ ```
86
+
87
+ ### Download Pre-trained Models
88
+ Download the necessary pre-trained models before proceeding to inference.
89
+
90
+ ```shell
91
+ STAR/checkpoints/STAR-7B.pt
92
+ STAR/checkpoints/VQ-Model.pt
93
+ ```
94
+
95
+ ### Configuration
96
+
97
+ The model configuration file `star/configs/STAR_Qwen2.5-VL-7B.json` contains all necessary parameters for model initialization. Make sure to update the paths in the configuration file to match your local setup.
98
+
99
+ ## ๐Ÿ”ฅ Quick Start
100
+
101
+ ### Demo
102
+
103
+ Run the interactive demo interface using Gradio.
104
+
105
+ ```shell
106
+ python3 gradio_app.py
107
+ ```
108
+
109
+ ### Inference
110
+
111
+ ### 1. Image Understanding
112
+
113
+ For visual question answering and image understanding tasks:
114
+
115
+ ```shell
116
+ python3 inference_understand.py \
117
+ --image-path "path/to/your/image.jpg" \
118
+ --question "What is in this image? Describe it in detail." \
119
+ --max-new-tokens 256 \
120
+ --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
121
+ --checkpoint "checkpoints/STAR-7B.pt" \
122
+ --device "cuda:0"
123
+ ```
124
+
125
+ **Parameters:**
126
+ - `--image-path`: Path to the input image
127
+ - `--question`: Question or instruction for the model
128
+ - `--max-new-tokens`: Maximum number of tokens to generate (default: 256)
129
+ - `--model-config`: Path to model configuration file
130
+ - `--checkpoint`: Path to model checkpoint
131
+ - `--device`: Device to run inference on
132
+
133
+ ### 2. Text-to-Image Generation
134
+
135
+ For generating images from text prompts:
136
+
137
+ ```shell
138
+ python3 inference_generation.py \
139
+ --prompt "a photo of a cute cat" \
140
+ --save-path "./outputs/a photo of a cute cat.jpg" \
141
+ --num-images 1 \
142
+ --cfg 1.1 \
143
+ --topk 1000 \
144
+ --topp 0.8 \
145
+ --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
146
+ --checkpoint "checkpoints/STAR-7B.pt" \
147
+ --diffusion-as-decoder \
148
+ --device "cuda:0"
149
+ ```
150
+
151
+ **Parameters:**
152
+ - `--prompt`: Text prompt for image generation
153
+ - `--save-path`: Path to save the generated image
154
+ - `--num-images`: Number of images to generate (default: 1)
155
+ - `--cfg`: Classifier-free guidance scale (default: 1.0)
156
+ - `--topk`: Top-k sampling parameter (default: 1000)
157
+ - `--topp`: Top-p sampling parameter (default: 0.8)
158
+ - `--diffusion-as-decoder`: Use diffusion model as decoder for high-quality generation
159
+
160
+ ### 3. Image Editing
161
+
162
+ For editing images based on text instructions:
163
+
164
+ ```shell
165
+ python3 inference_edit.py \
166
+ --image-path "./outputs/a photo of a cute cat.jpg" \
167
+ --instruction "change the color of cat to blue" \
168
+ --save-path "./outputs/edited_image.jpg" \
169
+ --cfg 1.1 \
170
+ --topk 1000 \
171
+ --topp 0.8 \
172
+ --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
173
+ --checkpoint "checkpoints/STAR-7B.pt" \
174
+ --diffusion-as-decoder \
175
+ --device "cuda:0"
176
+ ```
177
+
178
+ **Parameters:**
179
+ - `--image-path`: Path to the input image to be edited
180
+ - `--instruction`: Text instruction describing the desired edit
181
+ - `--save-path`: Path to save the edited image
182
+ - `--cfg`: Classifier-free guidance scale for editing
183
+ - `--topk`: Top-k sampling parameter
184
+ - `--topp`: Top-p sampling parameter
185
+ - `--diffusion-as-decoder`: Use diffusion model for high-quality image decoding
186
+
187
+
188
+
189
+
190
+ ## โœ๏ธ Citation
191
+
192
+ ```bibtex
193
+ @article{2025star,
194
+ title = {STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning},
195
+ author = {Qin, Jie and Huang, Jiancheng and Qiao, Limeng and Ma, Lin},
196
+ journal = {arXiv preprint arXiv:},
197
+ year = {2025}
198
+ }
199
+ ```
200
+
201
+
202
+ ## ๐Ÿ“œ License
203
+ STAR is licensed under the Apache 2.0.