--- license: apache-2.0 ---

STAR

# **STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning** Welcome to the official repository for our paper: "STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning" ## **Abstract** Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce ***STAR***: a **ST**acked **A**uto**R**egressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (**0.91**), DPG-Bench (**87.44**), and ImgEdit (**4.34**), validating its efficacy for unified multimodal learning.

## 🌟 Model Checkpoint | Model Name | Checkpoint | | :--------: | :--------: | | STAR-3B | [Link](https://huggingface.co/MM-MVR/STAR-3B) | | STAR-7B | [Link](https://huggingface.co/MM-MVR/STAR-7B) | | VQ Model | [Link](https://huggingface.co/MM-MVR/STAR-VQ) | ## 📚 Preparation ### Prepare the environment 1. Set up environment ```shell git clone cd STAR conda create -n star python==3.11 -y conda activate star ``` 2. Install the required packages: ```shell # upgrade pip and setuptools if necessary pip install -U pip setuptools # install required packages pip install -r requirements.txt ``` ### Download Pre-trained Models Download the necessary pre-trained models before proceeding to inference. ```shell STAR/checkpoints/STAR-3B.pt STAR/checkpoints/VQ-Model.pt ``` ### Configuration The model configuration file `star/configs/STAR_Qwen2.5-VL-3B.json` contains all necessary parameters for model initialization. Make sure to update the paths in the configuration file to match your local setup. ## 🔥 Quick Start ### Demo Run the interactive demo interface using Gradio. ```shell python3 gradio_app.py ``` ### Inference ### 1. Image Understanding For visual question answering and image understanding tasks: ```shell python3 inference_understand.py \ --image-path "path/to/your/image.jpg" \ --question "What is in this image? Describe it in detail." \ --max-new-tokens 256 \ --model-config "star/configs/STAR_Qwen2.5-VL-3B.json" \ --checkpoint "checkpoints/STAR-7B.pt" \ --device "cuda:0" ``` **Parameters:** - `--image-path`: Path to the input image - `--question`: Question or instruction for the model - `--max-new-tokens`: Maximum number of tokens to generate (default: 256) - `--model-config`: Path to model configuration file - `--checkpoint`: Path to model checkpoint - `--device`: Device to run inference on ### 2. Text-to-Image Generation For generating images from text prompts: ```shell python3 inference_generation.py \ --prompt "a photo of a cute cat" \ --save-path "./outputs/a photo of a cute cat.jpg" \ --num-images 1 \ --cfg 1.1 \ --topk 1000 \ --topp 0.8 \ --model-config "star/configs/STAR_Qwen2.5-VL-3B.json" \ --checkpoint "checkpoints/STAR-7B.pt" \ --diffusion-as-decoder \ --device "cuda:0" ``` **Parameters:** - `--prompt`: Text prompt for image generation - `--save-path`: Path to save the generated image - `--num-images`: Number of images to generate (default: 1) - `--cfg`: Classifier-free guidance scale (default: 1.0) - `--topk`: Top-k sampling parameter (default: 1000) - `--topp`: Top-p sampling parameter (default: 0.8) - `--diffusion-as-decoder`: Use diffusion model as decoder for high-quality generation ### 3. Image Editing For editing images based on text instructions: ```shell python3 inference_edit.py \ --image-path "./outputs/a photo of a cute cat.jpg" \ --instruction "change the color of cat to blue" \ --save-path "./outputs/edited_image.jpg" \ --cfg 1.1 \ --topk 1000 \ --topp 0.8 \ --model-config "star/configs/STAR_Qwen2.5-VL-3B.json" \ --checkpoint "checkpoints/STAR-7B.pt" \ --diffusion-as-decoder \ --device "cuda:0" ``` **Parameters:** - `--image-path`: Path to the input image to be edited - `--instruction`: Text instruction describing the desired edit - `--save-path`: Path to save the edited image - `--cfg`: Classifier-free guidance scale for editing - `--topk`: Top-k sampling parameter - `--topp`: Top-p sampling parameter - `--diffusion-as-decoder`: Use diffusion model for high-quality image decoding ## ✍️ Citation ```bibtex @article{2025star, title = {STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning}, author = {Qin, Jie and Huang, Jiancheng and Qiao, Limeng and Ma, Lin}, journal = {arXiv preprint arXiv:2512.13752}, year = {2025} } ``` ## 📜 License STAR is licensed under the Apache 2.0.