File size: 6,356 Bytes
ecae24a
 
 
 
 
 
 
 
 
 
a5b2e53
ecae24a
 
 
 
 
d101066
ecae24a
 
 
 
 
d101066
ecae24a
d101066
 
ecae24a
 
d101066
ecae24a
d101066
 
ecae24a
 
 
 
 
 
 
 
 
 
 
9dc7efa
ecae24a
 
 
 
 
 
 
 
 
d101066
 
 
 
 
ecae24a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a5b2e53
ecae24a
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
license: apache-2.0
---


<p align="center">
  <img src="assets/star_logo.png" alt="STAR" width="560"/>
</p>

<p align="center">
  <a href="https://arxiv.org/abs/2512.13752">
    <img
      src="https://img.shields.io/badge/STAR-Paper-red?logo=arxiv&logoColor=red"
      alt="STAR Paper on arXiv"
    />
  </a>
  <a href="https://star-mm-ai.github.io/">
    <img
      src="https://img.shields.io/badge/STAR-Project-0A66C2?logo=safari&logoColor=white"
      alt="STAR Project"
    />
  </a>
  <a href="https://huggingface.co/spaces/MM-MVR/STAR">
    <img 
        src="https://img.shields.io/badge/STAR-Space-orange?logo=huggingface&logoColor=yellow" 
        alt="STAR Demo"
    />
  </a>
  <a href="https://huggingface.co/MM-MVR/STAR-7B">
    <img 
        src="https://img.shields.io/badge/STAR-Models-yellow?logo=huggingface&logoColor=yellow" 
        alt="STAR Models"
    />
  </a>
</p>

# **STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning**


Welcome to the official repository for our paper: "STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning"


## **Abstract**
Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce ***STAR***: a **ST**acked **A**uto**R**egressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (**0.91**), DPG-Bench (**87.44**), and ImgEdit (**4.34**), validating its efficacy for unified multimodal learning.

<div align="center">
  <img src="assets/teaser.png" width=100%></img>
</div>


## 🌟 Model Checkpoint


| Model Name | Checkpoint | 
| :--------: | :--------: | 
| STAR-3B | [Link](https://huggingface.co/MM-MVR/STAR-3B)  |
| STAR-7B | [Link](https://huggingface.co/MM-MVR/STAR-7B) | 
| VQ Model | [Link](https://huggingface.co/MM-MVR/STAR-VQ) | 


## πŸ“š Preparation

### Prepare the environment

1. Set up environment
```shell
git clone <repository-url>
cd STAR
conda create -n star python==3.11 -y
conda activate star
```

2. Install the required packages:
```shell
# upgrade pip and setuptools if necessary
pip install -U pip setuptools
# install required packages
pip install -r requirements.txt
```

### Download Pre-trained Models
Download the necessary pre-trained models before proceeding to inference.

```shell
STAR/checkpoints/STAR-7B.pt
STAR/checkpoints/VQ-Model.pt
```

### Configuration

The model configuration file `star/configs/STAR_Qwen2.5-VL-7B.json` contains all necessary parameters for model initialization. Make sure to update the paths in the configuration file to match your local setup.

## πŸ”₯ Quick Start

### Demo

Run the interactive demo interface using Gradio.

```shell
python3 gradio_app.py 
```

### Inference

### 1. Image Understanding

For visual question answering and image understanding tasks:

```shell
python3 inference_understand.py \
    --image-path "path/to/your/image.jpg" \
    --question "What is in this image? Describe it in detail." \
    --max-new-tokens 256 \
    --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
    --checkpoint "checkpoints/STAR-7B.pt" \
    --device "cuda:0"
```

**Parameters:**
- `--image-path`: Path to the input image
- `--question`: Question or instruction for the model
- `--max-new-tokens`: Maximum number of tokens to generate (default: 256)
- `--model-config`: Path to model configuration file
- `--checkpoint`: Path to model checkpoint
- `--device`: Device to run inference on

### 2. Text-to-Image Generation

For generating images from text prompts:

```shell
python3 inference_generation.py \
    --prompt "a photo of a cute cat" \
    --save-path "./outputs/a photo of a cute cat.jpg" \
    --num-images 1 \
    --cfg 1.1 \
    --topk 1000 \
    --topp 0.8 \
    --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
    --checkpoint "checkpoints/STAR-7B.pt" \
    --diffusion-as-decoder \
    --device "cuda:0"
```

**Parameters:**
- `--prompt`: Text prompt for image generation
- `--save-path`: Path to save the generated image
- `--num-images`: Number of images to generate (default: 1)
- `--cfg`: Classifier-free guidance scale (default: 1.0)
- `--topk`: Top-k sampling parameter (default: 1000)
- `--topp`: Top-p sampling parameter (default: 0.8)
- `--diffusion-as-decoder`: Use diffusion model as decoder for high-quality generation

### 3. Image Editing

For editing images based on text instructions:

```shell
python3 inference_edit.py \
    --image-path "./outputs/a photo of a cute cat.jpg" \
    --instruction "change the color of cat to blue" \
    --save-path "./outputs/edited_image.jpg" \
    --cfg 1.1 \
    --topk 1000 \
    --topp 0.8 \
    --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
    --checkpoint "checkpoints/STAR-7B.pt" \
    --diffusion-as-decoder \
    --device "cuda:0"
```

**Parameters:**
- `--image-path`: Path to the input image to be edited
- `--instruction`: Text instruction describing the desired edit
- `--save-path`: Path to save the edited image
- `--cfg`: Classifier-free guidance scale for editing
- `--topk`: Top-k sampling parameter
- `--topp`: Top-p sampling parameter
- `--diffusion-as-decoder`: Use diffusion model for high-quality image decoding




## ✍️ Citation

```bibtex
@article{2025star,
  title   = {STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning},
  author  = {Qin, Jie and Huang, Jiancheng and Qiao, Limeng and Ma, Lin},
  journal = {arXiv preprint arXiv:2512.13752},
  year    = {2025}
}
```


## πŸ“œ License
STAR is licensed under the Apache 2.0.