File size: 6,356 Bytes
ecae24a a5b2e53 ecae24a d101066 ecae24a d101066 ecae24a d101066 ecae24a d101066 ecae24a d101066 ecae24a 9dc7efa ecae24a d101066 ecae24a a5b2e53 ecae24a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 |
---
license: apache-2.0
---
<p align="center">
<img src="assets/star_logo.png" alt="STAR" width="560"/>
</p>
<p align="center">
<a href="https://arxiv.org/abs/2512.13752">
<img
src="https://img.shields.io/badge/STAR-Paper-red?logo=arxiv&logoColor=red"
alt="STAR Paper on arXiv"
/>
</a>
<a href="https://star-mm-ai.github.io/">
<img
src="https://img.shields.io/badge/STAR-Project-0A66C2?logo=safari&logoColor=white"
alt="STAR Project"
/>
</a>
<a href="https://huggingface.co/spaces/MM-MVR/STAR">
<img
src="https://img.shields.io/badge/STAR-Space-orange?logo=huggingface&logoColor=yellow"
alt="STAR Demo"
/>
</a>
<a href="https://huggingface.co/MM-MVR/STAR-7B">
<img
src="https://img.shields.io/badge/STAR-Models-yellow?logo=huggingface&logoColor=yellow"
alt="STAR Models"
/>
</a>
</p>
# **STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning**
Welcome to the official repository for our paper: "STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning"
## **Abstract**
Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce ***STAR***: a **ST**acked **A**uto**R**egressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (**0.91**), DPG-Bench (**87.44**), and ImgEdit (**4.34**), validating its efficacy for unified multimodal learning.
<div align="center">
<img src="assets/teaser.png" width=100%></img>
</div>
## π Model Checkpoint
| Model Name | Checkpoint |
| :--------: | :--------: |
| STAR-3B | [Link](https://huggingface.co/MM-MVR/STAR-3B) |
| STAR-7B | [Link](https://huggingface.co/MM-MVR/STAR-7B) |
| VQ Model | [Link](https://huggingface.co/MM-MVR/STAR-VQ) |
## π Preparation
### Prepare the environment
1. Set up environment
```shell
git clone <repository-url>
cd STAR
conda create -n star python==3.11 -y
conda activate star
```
2. Install the required packages:
```shell
# upgrade pip and setuptools if necessary
pip install -U pip setuptools
# install required packages
pip install -r requirements.txt
```
### Download Pre-trained Models
Download the necessary pre-trained models before proceeding to inference.
```shell
STAR/checkpoints/STAR-7B.pt
STAR/checkpoints/VQ-Model.pt
```
### Configuration
The model configuration file `star/configs/STAR_Qwen2.5-VL-7B.json` contains all necessary parameters for model initialization. Make sure to update the paths in the configuration file to match your local setup.
## π₯ Quick Start
### Demo
Run the interactive demo interface using Gradio.
```shell
python3 gradio_app.py
```
### Inference
### 1. Image Understanding
For visual question answering and image understanding tasks:
```shell
python3 inference_understand.py \
--image-path "path/to/your/image.jpg" \
--question "What is in this image? Describe it in detail." \
--max-new-tokens 256 \
--model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
--checkpoint "checkpoints/STAR-7B.pt" \
--device "cuda:0"
```
**Parameters:**
- `--image-path`: Path to the input image
- `--question`: Question or instruction for the model
- `--max-new-tokens`: Maximum number of tokens to generate (default: 256)
- `--model-config`: Path to model configuration file
- `--checkpoint`: Path to model checkpoint
- `--device`: Device to run inference on
### 2. Text-to-Image Generation
For generating images from text prompts:
```shell
python3 inference_generation.py \
--prompt "a photo of a cute cat" \
--save-path "./outputs/a photo of a cute cat.jpg" \
--num-images 1 \
--cfg 1.1 \
--topk 1000 \
--topp 0.8 \
--model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
--checkpoint "checkpoints/STAR-7B.pt" \
--diffusion-as-decoder \
--device "cuda:0"
```
**Parameters:**
- `--prompt`: Text prompt for image generation
- `--save-path`: Path to save the generated image
- `--num-images`: Number of images to generate (default: 1)
- `--cfg`: Classifier-free guidance scale (default: 1.0)
- `--topk`: Top-k sampling parameter (default: 1000)
- `--topp`: Top-p sampling parameter (default: 0.8)
- `--diffusion-as-decoder`: Use diffusion model as decoder for high-quality generation
### 3. Image Editing
For editing images based on text instructions:
```shell
python3 inference_edit.py \
--image-path "./outputs/a photo of a cute cat.jpg" \
--instruction "change the color of cat to blue" \
--save-path "./outputs/edited_image.jpg" \
--cfg 1.1 \
--topk 1000 \
--topp 0.8 \
--model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
--checkpoint "checkpoints/STAR-7B.pt" \
--diffusion-as-decoder \
--device "cuda:0"
```
**Parameters:**
- `--image-path`: Path to the input image to be edited
- `--instruction`: Text instruction describing the desired edit
- `--save-path`: Path to save the edited image
- `--cfg`: Classifier-free guidance scale for editing
- `--topk`: Top-k sampling parameter
- `--topp`: Top-p sampling parameter
- `--diffusion-as-decoder`: Use diffusion model for high-quality image decoding
## βοΈ Citation
```bibtex
@article{2025star,
title = {STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning},
author = {Qin, Jie and Huang, Jiancheng and Qiao, Limeng and Ma, Lin},
journal = {arXiv preprint arXiv:2512.13752},
year = {2025}
}
```
## π License
STAR is licensed under the Apache 2.0. |