|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
|
|
|
<p align="center"> |
|
|
<img src="assets/star_logo.png" alt="STAR" width="560"/> |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://arxiv.org/abs/2512.13752"> |
|
|
<img |
|
|
src="https://img.shields.io/badge/STAR-Paper-red?logo=arxiv&logoColor=red" |
|
|
alt="STAR Paper on arXiv" |
|
|
/> |
|
|
</a> |
|
|
<a href="https://star-mm-ai.github.io/"> |
|
|
<img |
|
|
src="https://img.shields.io/badge/STAR-Project-0A66C2?logo=safari&logoColor=white" |
|
|
alt="STAR Project" |
|
|
/> |
|
|
</a> |
|
|
<a href="https://huggingface.co/spaces/MM-MVR/STAR"> |
|
|
<img |
|
|
src="https://img.shields.io/badge/STAR-Space-orange?logo=huggingface&logoColor=yellow" |
|
|
alt="STAR Demo" |
|
|
/> |
|
|
</a> |
|
|
<a href="https://huggingface.co/MM-MVR/STAR-7B"> |
|
|
<img |
|
|
src="https://img.shields.io/badge/STAR-Models-yellow?logo=huggingface&logoColor=yellow" |
|
|
alt="STAR Models" |
|
|
/> |
|
|
</a> |
|
|
</p> |
|
|
|
|
|
# **STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning** |
|
|
|
|
|
|
|
|
Welcome to the official repository for our paper: "STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning" |
|
|
|
|
|
|
|
|
## **Abstract** |
|
|
Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce ***STAR***: a **ST**acked **A**uto**R**egressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (**0.91**), DPG-Bench (**87.44**), and ImgEdit (**4.34**), validating its efficacy for unified multimodal learning. |
|
|
|
|
|
<div align="center"> |
|
|
<img src="assets/teaser.png" width=100%></img> |
|
|
</div> |
|
|
|
|
|
|
|
|
## π Model Checkpoint |
|
|
|
|
|
|
|
|
| Model Name | Checkpoint | |
|
|
| :--------: | :--------: | |
|
|
| STAR-3B | [Link](https://huggingface.co/MM-MVR/STAR-3B) | |
|
|
| STAR-7B | [Link](https://huggingface.co/MM-MVR/STAR-7B) | |
|
|
| VQ Model | [Link](https://huggingface.co/MM-MVR/STAR-VQ) | |
|
|
|
|
|
|
|
|
## π Preparation |
|
|
|
|
|
### Prepare the environment |
|
|
|
|
|
1. Set up environment |
|
|
```shell |
|
|
git clone <repository-url> |
|
|
cd STAR |
|
|
conda create -n star python==3.11 -y |
|
|
conda activate star |
|
|
``` |
|
|
|
|
|
2. Install the required packages: |
|
|
```shell |
|
|
# upgrade pip and setuptools if necessary |
|
|
pip install -U pip setuptools |
|
|
# install required packages |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### Download Pre-trained Models |
|
|
Download the necessary pre-trained models before proceeding to inference. |
|
|
|
|
|
```shell |
|
|
STAR/checkpoints/STAR-7B.pt |
|
|
STAR/checkpoints/VQ-Model.pt |
|
|
``` |
|
|
|
|
|
### Configuration |
|
|
|
|
|
The model configuration file `star/configs/STAR_Qwen2.5-VL-7B.json` contains all necessary parameters for model initialization. Make sure to update the paths in the configuration file to match your local setup. |
|
|
|
|
|
## π₯ Quick Start |
|
|
|
|
|
### Demo |
|
|
|
|
|
Run the interactive demo interface using Gradio. |
|
|
|
|
|
```shell |
|
|
python3 gradio_app.py |
|
|
``` |
|
|
|
|
|
### Inference |
|
|
|
|
|
### 1. Image Understanding |
|
|
|
|
|
For visual question answering and image understanding tasks: |
|
|
|
|
|
```shell |
|
|
python3 inference_understand.py \ |
|
|
--image-path "path/to/your/image.jpg" \ |
|
|
--question "What is in this image? Describe it in detail." \ |
|
|
--max-new-tokens 256 \ |
|
|
--model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \ |
|
|
--checkpoint "checkpoints/STAR-7B.pt" \ |
|
|
--device "cuda:0" |
|
|
``` |
|
|
|
|
|
**Parameters:** |
|
|
- `--image-path`: Path to the input image |
|
|
- `--question`: Question or instruction for the model |
|
|
- `--max-new-tokens`: Maximum number of tokens to generate (default: 256) |
|
|
- `--model-config`: Path to model configuration file |
|
|
- `--checkpoint`: Path to model checkpoint |
|
|
- `--device`: Device to run inference on |
|
|
|
|
|
### 2. Text-to-Image Generation |
|
|
|
|
|
For generating images from text prompts: |
|
|
|
|
|
```shell |
|
|
python3 inference_generation.py \ |
|
|
--prompt "a photo of a cute cat" \ |
|
|
--save-path "./outputs/a photo of a cute cat.jpg" \ |
|
|
--num-images 1 \ |
|
|
--cfg 1.1 \ |
|
|
--topk 1000 \ |
|
|
--topp 0.8 \ |
|
|
--model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \ |
|
|
--checkpoint "checkpoints/STAR-7B.pt" \ |
|
|
--diffusion-as-decoder \ |
|
|
--device "cuda:0" |
|
|
``` |
|
|
|
|
|
**Parameters:** |
|
|
- `--prompt`: Text prompt for image generation |
|
|
- `--save-path`: Path to save the generated image |
|
|
- `--num-images`: Number of images to generate (default: 1) |
|
|
- `--cfg`: Classifier-free guidance scale (default: 1.0) |
|
|
- `--topk`: Top-k sampling parameter (default: 1000) |
|
|
- `--topp`: Top-p sampling parameter (default: 0.8) |
|
|
- `--diffusion-as-decoder`: Use diffusion model as decoder for high-quality generation |
|
|
|
|
|
### 3. Image Editing |
|
|
|
|
|
For editing images based on text instructions: |
|
|
|
|
|
```shell |
|
|
python3 inference_edit.py \ |
|
|
--image-path "./outputs/a photo of a cute cat.jpg" \ |
|
|
--instruction "change the color of cat to blue" \ |
|
|
--save-path "./outputs/edited_image.jpg" \ |
|
|
--cfg 1.1 \ |
|
|
--topk 1000 \ |
|
|
--topp 0.8 \ |
|
|
--model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \ |
|
|
--checkpoint "checkpoints/STAR-7B.pt" \ |
|
|
--diffusion-as-decoder \ |
|
|
--device "cuda:0" |
|
|
``` |
|
|
|
|
|
**Parameters:** |
|
|
- `--image-path`: Path to the input image to be edited |
|
|
- `--instruction`: Text instruction describing the desired edit |
|
|
- `--save-path`: Path to save the edited image |
|
|
- `--cfg`: Classifier-free guidance scale for editing |
|
|
- `--topk`: Top-k sampling parameter |
|
|
- `--topp`: Top-p sampling parameter |
|
|
- `--diffusion-as-decoder`: Use diffusion model for high-quality image decoding |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## βοΈ Citation |
|
|
|
|
|
```bibtex |
|
|
@article{2025star, |
|
|
title = {STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning}, |
|
|
author = {Qin, Jie and Huang, Jiancheng and Qiao, Limeng and Ma, Lin}, |
|
|
journal = {arXiv preprint arXiv:2512.13752}, |
|
|
year = {2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
|
|
|
## π License |
|
|
STAR is licensed under the Apache 2.0. |