STAR-VQ / README.md

Update README.md

a5b2e53 verified 10 days ago

6.36 kB

	---
	license: apache-2.0
	---


	<p align="center">
	<img src="assets/star_logo.png" alt="STAR" width="560"/>
	</p>

	<p align="center">
	<a href="https://arxiv.org/abs/2512.13752">
	<img
	src="https://img.shields.io/badge/STAR-Paper-red?logo=arxiv&logoColor=red"
	alt="STAR Paper on arXiv"
	/>
	</a>
	<a href="https://star-mm-ai.github.io/">
	<img
	src="https://img.shields.io/badge/STAR-Project-0A66C2?logo=safari&logoColor=white"
	alt="STAR Project"
	/>
	</a>
	<a href="https://huggingface.co/spaces/MM-MVR/STAR">
	<img
	src="https://img.shields.io/badge/STAR-Space-orange?logo=huggingface&logoColor=yellow"
	alt="STAR Demo"
	/>
	</a>
	<a href="https://huggingface.co/MM-MVR/STAR-7B">
	<img
	src="https://img.shields.io/badge/STAR-Models-yellow?logo=huggingface&logoColor=yellow"
	alt="STAR Models"
	/>
	</a>
	</p>

	# STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning


	Welcome to the official repository for our paper: "STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning"


	## Abstract
	Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce *STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34*), validating its efficacy for unified multimodal learning.

	<div align="center">
	<img src="assets/teaser.png" width=100%></img>
	</div>


	## 🌟 Model Checkpoint


	\| Model Name \| Checkpoint \|
	\| :--------: \| :--------: \|
	\| STAR-3B \| [Link](https://huggingface.co/MM-MVR/STAR-3B) \|
	\| STAR-7B \| [Link](https://huggingface.co/MM-MVR/STAR-7B) \|
	\| VQ Model \| [Link](https://huggingface.co/MM-MVR/STAR-VQ) \|


	## 📚 Preparation

	### Prepare the environment

	1. Set up environment
	```shell
	git clone <repository-url>
	cd STAR
	conda create -n star python==3.11 -y
	conda activate star
	```

	2. Install the required packages:
	```shell
	# upgrade pip and setuptools if necessary
	pip install -U pip setuptools
	# install required packages
	pip install -r requirements.txt
	```

	### Download Pre-trained Models
	Download the necessary pre-trained models before proceeding to inference.

	```shell
	STAR/checkpoints/STAR-7B.pt
	STAR/checkpoints/VQ-Model.pt
	```

	### Configuration

	The model configuration file `star/configs/STAR_Qwen2.5-VL-7B.json` contains all necessary parameters for model initialization. Make sure to update the paths in the configuration file to match your local setup.

	## 🔥 Quick Start

	### Demo

	Run the interactive demo interface using Gradio.

	```shell
	python3 gradio_app.py
	```

	### Inference

	### 1. Image Understanding

	For visual question answering and image understanding tasks:

	```shell
	python3 inference_understand.py \
	--image-path "path/to/your/image.jpg" \
	--question "What is in this image? Describe it in detail." \
	--max-new-tokens 256 \
	--model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
	--checkpoint "checkpoints/STAR-7B.pt" \
	--device "cuda:0"
	```

	Parameters:
	- `--image-path`: Path to the input image
	- `--question`: Question or instruction for the model
	- `--max-new-tokens`: Maximum number of tokens to generate (default: 256)
	- `--model-config`: Path to model configuration file
	- `--checkpoint`: Path to model checkpoint
	- `--device`: Device to run inference on

	### 2. Text-to-Image Generation

	For generating images from text prompts:

	```shell
	python3 inference_generation.py \
	--prompt "a photo of a cute cat" \
	--save-path "./outputs/a photo of a cute cat.jpg" \
	--num-images 1 \
	--cfg 1.1 \
	--topk 1000 \
	--topp 0.8 \
	--model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
	--checkpoint "checkpoints/STAR-7B.pt" \
	--diffusion-as-decoder \
	--device "cuda:0"
	```

	Parameters:
	- `--prompt`: Text prompt for image generation
	- `--save-path`: Path to save the generated image
	- `--num-images`: Number of images to generate (default: 1)
	- `--cfg`: Classifier-free guidance scale (default: 1.0)
	- `--topk`: Top-k sampling parameter (default: 1000)
	- `--topp`: Top-p sampling parameter (default: 0.8)
	- `--diffusion-as-decoder`: Use diffusion model as decoder for high-quality generation

	### 3. Image Editing

	For editing images based on text instructions:

	```shell
	python3 inference_edit.py \
	--image-path "./outputs/a photo of a cute cat.jpg" \
	--instruction "change the color of cat to blue" \
	--save-path "./outputs/edited_image.jpg" \
	--cfg 1.1 \
	--topk 1000 \
	--topp 0.8 \
	--model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
	--checkpoint "checkpoints/STAR-7B.pt" \
	--diffusion-as-decoder \
	--device "cuda:0"
	```

	Parameters:
	- `--image-path`: Path to the input image to be edited
	- `--instruction`: Text instruction describing the desired edit
	- `--save-path`: Path to save the edited image
	- `--cfg`: Classifier-free guidance scale for editing
	- `--topk`: Top-k sampling parameter
	- `--topp`: Top-p sampling parameter
	- `--diffusion-as-decoder`: Use diffusion model for high-quality image decoding




	## ✍️ Citation

	```bibtex
	@article{2025star,
	title = {STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning},
	author = {Qin, Jie and Huang, Jiancheng and Qiao, Limeng and Ma, Lin},
	journal = {arXiv preprint arXiv:2512.13752},
	year = {2025}
	}
	```


	## 📜 License
	STAR is licensed under the Apache 2.0.