File size: 6,356 Bytes

---
license: apache-2.0
---


<p align="center">
  <img src="assets/star_logo.png" alt="STAR" width="560"/>
</p>

<p align="center">
  <a href="https://arxiv.org/abs/2512.13752">
    <img
      src="https://img.shields.io/badge/STAR-Paper-red?logo=arxiv&logoColor=red"
      alt="STAR Paper on arXiv"
    />
  </a>
  <a href="https://star-mm-ai.github.io/">
    <img
      src="https://img.shields.io/badge/STAR-Project-0A66C2?logo=safari&logoColor=white"
      alt="STAR Project"
    />
  </a>
  <a href="https://huggingface.co/spaces/MM-MVR/STAR">
    <img 
        src="https://img.shields.io/badge/STAR-Space-orange?logo=huggingface&logoColor=yellow" 
        alt="STAR Demo"
    />
  </a>
  <a href="https://huggingface.co/MM-MVR/STAR-7B">
    <img 
        src="https://img.shields.io/badge/STAR-Models-yellow?logo=huggingface&logoColor=yellow" 
        alt="STAR Models"
    />
  </a>
</p>

# **STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning**


Welcome to the official repository for our paper: "STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning"


## **Abstract**
Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce ***STAR***: a **ST**acked **A**uto**R**egressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (**0.91**), DPG-Bench (**87.44**), and ImgEdit (**4.34**), validating its efficacy for unified multimodal learning.

<div align="center">
  <img src="assets/teaser.png" width=100%></img>
</div>


## 🌟 Model Checkpoint


| Model Name | Checkpoint | 
| :--------: | :--------: | 
| STAR-3B | [Link](https://huggingface.co/MM-MVR/STAR-3B)  |
| STAR-7B | [Link](https://huggingface.co/MM-MVR/STAR-7B) | 
| VQ Model | [Link](https://huggingface.co/MM-MVR/STAR-VQ) | 


## 📚 Preparation

### Prepare the environment

1. Set up environment
```shell
git clone <repository-url>
cd STAR
conda create -n star python==3.11 -y
conda activate star
```

2. Install the required packages:
```shell
# upgrade pip and setuptools if necessary
pip install -U pip setuptools
# install required packages
pip install -r requirements.txt
```

### Download Pre-trained Models
Download the necessary pre-trained models before proceeding to inference.

```shell
STAR/checkpoints/STAR-7B.pt
STAR/checkpoints/VQ-Model.pt
```

### Configuration

The model configuration file `star/configs/STAR_Qwen2.5-VL-7B.json` contains all necessary parameters for model initialization. Make sure to update the paths in the configuration file to match your local setup.

## 🔥 Quick Start

### Demo

Run the interactive demo interface using Gradio.

```shell
python3 gradio_app.py 
```

### Inference

### 1. Image Understanding

For visual question answering and image understanding tasks:

```shell
python3 inference_understand.py \
    --image-path "path/to/your/image.jpg" \
    --question "What is in this image? Describe it in detail." \
    --max-new-tokens 256 \
    --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
    --checkpoint "checkpoints/STAR-7B.pt" \
    --device "cuda:0"
```

**Parameters:**
- `--image-path`: Path to the input image
- `--question`: Question or instruction for the model
- `--max-new-tokens`: Maximum number of tokens to generate (default: 256)
- `--model-config`: Path to model configuration file
- `--checkpoint`: Path to model checkpoint
- `--device`: Device to run inference on

### 2. Text-to-Image Generation

For generating images from text prompts:

```shell
python3 inference_generation.py \
    --prompt "a photo of a cute cat" \
    --save-path "./outputs/a photo of a cute cat.jpg" \
    --num-images 1 \
    --cfg 1.1 \
    --topk 1000 \
    --topp 0.8 \
    --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
    --checkpoint "checkpoints/STAR-7B.pt" \
    --diffusion-as-decoder \
    --device "cuda:0"
```

**Parameters:**
- `--prompt`: Text prompt for image generation
- `--save-path`: Path to save the generated image
- `--num-images`: Number of images to generate (default: 1)
- `--cfg`: Classifier-free guidance scale (default: 1.0)
- `--topk`: Top-k sampling parameter (default: 1000)
- `--topp`: Top-p sampling parameter (default: 0.8)
- `--diffusion-as-decoder`: Use diffusion model as decoder for high-quality generation

### 3. Image Editing

For editing images based on text instructions:

```shell
python3 inference_edit.py \
    --image-path "./outputs/a photo of a cute cat.jpg" \
    --instruction "change the color of cat to blue" \
    --save-path "./outputs/edited_image.jpg" \
    --cfg 1.1 \
    --topk 1000 \
    --topp 0.8 \
    --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
    --checkpoint "checkpoints/STAR-7B.pt" \
    --diffusion-as-decoder \
    --device "cuda:0"
```

**Parameters:**
- `--image-path`: Path to the input image to be edited
- `--instruction`: Text instruction describing the desired edit
- `--save-path`: Path to save the edited image
- `--cfg`: Classifier-free guidance scale for editing
- `--topk`: Top-k sampling parameter
- `--topp`: Top-p sampling parameter
- `--diffusion-as-decoder`: Use diffusion model for high-quality image decoding




## ✍️ Citation

```bibtex
@article{2025star,
  title   = {STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning},
  author  = {Qin, Jie and Huang, Jiancheng and Qiao, Limeng and Ma, Lin},
  journal = {arXiv preprint arXiv:2512.13752},
  year    = {2025}
}
```


## 📜 License
STAR is licensed under the Apache 2.0.