MM-MVR commited on
Commit
9a6a106
ยท
verified ยท
1 Parent(s): ed257ce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +204 -3
README.md CHANGED
@@ -1,3 +1,204 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen2.5-VL-7B-Instruct
5
+ ---
6
+
7
+
8
+ <p align="center">
9
+ <img src="assets/star_logo.png" alt="STAR" width="560"/>
10
+ </p>
11
+
12
+ <p align="center">
13
+ <a href="https://arxiv.org/abs/xxxx.xxxxx">
14
+ <img
15
+ src="https://img.shields.io/badge/STAR-Paper-red?logo=arxiv&logoColor=red"
16
+ alt="STAR Paper on arXiv"
17
+ />
18
+ </a>
19
+ <a href="#">
20
+ <img
21
+ src="https://img.shields.io/badge/STAR-Project-0A66C2?logo=safari&logoColor=white"
22
+ alt="STAR Project"
23
+ />
24
+ </a>
25
+ <a href="#">
26
+ <img
27
+ src="https://img.shields.io/badge/STAR-Models-yellow?logo=huggingface&logoColor=yellow"
28
+ alt="STAR Models"
29
+ />
30
+ </a>
31
+ <a href="#">
32
+ <img
33
+ src="https://img.shields.io/badge/STAR-Demo-blue?logo=googleplay&logoColor=blue"
34
+ alt="STAR Demo"
35
+ />
36
+ </a>
37
+ <a href="#">
38
+ <img
39
+ src="https://img.shields.io/badge/STAR-Space-orange?logo=huggingface&logoColor=yellow"
40
+ alt="STAR HuggingFace Space"
41
+ />
42
+ </a>
43
+ </p>
44
+ # **STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning**
45
+
46
+
47
+ Welcome to the official repository for our paper: "STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning"
48
+
49
+
50
+ ## **Abstract**
51
+ Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce ***STAR***: *a **ST**acked **A**uto**R**egressive scheme for task-progressive unified multimodal learning*. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (**0.91**), DPG-Bench (**87.44**), and ImgEdit (**4.34**), validating its efficacy for unified multimodal learning.
52
+
53
+ <div align="center">
54
+ <img src="assets/teaser.png" width=100%></img>
55
+ </div>
56
+
57
+
58
+ ## ๐ŸŒŸ Model Checkpoint
59
+
60
+
61
+ | Model Name | Checkpoint | Config |
62
+ | :--------: | :--------: | :----: |
63
+ | STAR-3B | [Link](#) | [Config](star/configs/STAR_Qwen2.5-VL-3B.json) |
64
+ | STAR-7B | [Link](#) | [Config](star/configs/STAR_Qwen2.5-VL-7B.json) |
65
+ | VQ Model | [Link](#) | - |
66
+
67
+
68
+ ## ๐Ÿ“š Preparation
69
+
70
+ ### Prepare the environment
71
+
72
+ 1. Set up environment
73
+ ```shell
74
+ git clone <repository-url>
75
+ cd STAR
76
+ conda create -n star python==3.11 -y
77
+ conda activate star
78
+ ```
79
+
80
+ 2. Install the required packages:
81
+ ```shell
82
+ # upgrade pip and setuptools if necessary
83
+ pip install -U pip setuptools
84
+ # install required packages
85
+ pip install -r requirements.txt
86
+ ```
87
+
88
+ ### Download Pre-trained Models
89
+ Download the necessary pre-trained models before proceeding to inference.
90
+
91
+ ```shell
92
+ STAR/checkpoints/STAR-7B.pt
93
+ STAR/checkpoints/VQ-Model.pt
94
+ ```
95
+
96
+ ### Configuration
97
+
98
+ The model configuration file `star/configs/STAR_Qwen2.5-VL-7B.json` contains all necessary parameters for model initialization. Make sure to update the paths in the configuration file to match your local setup.
99
+
100
+ ## ๐Ÿ”ฅ Quick Start
101
+
102
+ ### Demo
103
+
104
+ Run the interactive demo interface using Gradio.
105
+
106
+ ```shell
107
+ python3 gradio_app.py
108
+ ```
109
+
110
+ ### Inference
111
+
112
+ ### 1. Image Understanding
113
+
114
+ For visual question answering and image understanding tasks:
115
+
116
+ ```shell
117
+ python3 inference_understand.py \
118
+ --image-path "path/to/your/image.jpg" \
119
+ --question "What is in this image? Describe it in detail." \
120
+ --max-new-tokens 256 \
121
+ --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
122
+ --checkpoint "checkpoints/STAR-7B.pt" \
123
+ --device "cuda:0"
124
+ ```
125
+
126
+ **Parameters:**
127
+ - `--image-path`: Path to the input image
128
+ - `--question`: Question or instruction for the model
129
+ - `--max-new-tokens`: Maximum number of tokens to generate (default: 256)
130
+ - `--model-config`: Path to model configuration file
131
+ - `--checkpoint`: Path to model checkpoint
132
+ - `--device`: Device to run inference on
133
+
134
+ ### 2. Text-to-Image Generation
135
+
136
+ For generating images from text prompts:
137
+
138
+ ```shell
139
+ python3 inference_generation.py \
140
+ --prompt "a photo of a cute cat" \
141
+ --save-path "./outputs/a photo of a cute cat.jpg" \
142
+ --num-images 1 \
143
+ --cfg 1.1 \
144
+ --topk 1000 \
145
+ --topp 0.8 \
146
+ --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
147
+ --checkpoint "checkpoints/STAR-7B.pt" \
148
+ --diffusion-as-decoder \
149
+ --device "cuda:0"
150
+ ```
151
+
152
+ **Parameters:**
153
+ - `--prompt`: Text prompt for image generation
154
+ - `--save-path`: Path to save the generated image
155
+ - `--num-images`: Number of images to generate (default: 1)
156
+ - `--cfg`: Classifier-free guidance scale (default: 1.0)
157
+ - `--topk`: Top-k sampling parameter (default: 1000)
158
+ - `--topp`: Top-p sampling parameter (default: 0.8)
159
+ - `--diffusion-as-decoder`: Use diffusion model as decoder for high-quality generation
160
+
161
+ ### 3. Image Editing
162
+
163
+ For editing images based on text instructions:
164
+
165
+ ```shell
166
+ python3 inference_edit.py \
167
+ --image-path "./outputs/a photo of a cute cat.jpg" \
168
+ --instruction "change the color of cat to blue" \
169
+ --save-path "./outputs/edited_image.jpg" \
170
+ --cfg 1.1 \
171
+ --topk 1000 \
172
+ --topp 0.8 \
173
+ --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
174
+ --checkpoint "checkpoints/STAR-7B.pt" \
175
+ --diffusion-as-decoder \
176
+ --device "cuda:0"
177
+ ```
178
+
179
+ **Parameters:**
180
+ - `--image-path`: Path to the input image to be edited
181
+ - `--instruction`: Text instruction describing the desired edit
182
+ - `--save-path`: Path to save the edited image
183
+ - `--cfg`: Classifier-free guidance scale for editing
184
+ - `--topk`: Top-k sampling parameter
185
+ - `--topp`: Top-p sampling parameter
186
+ - `--diffusion-as-decoder`: Use diffusion model for high-quality image decoding
187
+
188
+
189
+
190
+
191
+ ## โœ๏ธ Citation
192
+
193
+ ```bibtex
194
+ @article{2025star,
195
+ title = {STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning},
196
+ author = {Qin, Jie and Huang, Jiancheng and Qiao, Limeng and Ma, Lin},
197
+ journal = {arXiv preprint arXiv:},
198
+ year = {2025}
199
+ }
200
+ ```
201
+
202
+
203
+ ## ๐Ÿ“œ License
204
+ STAR is licensed under the Apache 2.0.