MM-MVR
/

STAR-VQ

Model card Files Files and versions

xet

Community

MM-MVR commited on Nov 10

Commit

ecae24a

verified ·

1 Parent(s): bb69dc3

Update README.md

Browse files

Files changed (1) hide show

README.md +203 -3

README.md CHANGED Viewed

@@ -1,3 +1,203 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+<p align="center">
+  <img src="assets/star_logo.png" alt="STAR" width="560"/>
+</p>
+<p align="center">
+  <a href="https://arxiv.org/abs/xxxx.xxxxx">
+    <img
+      src="https://img.shields.io/badge/STAR-Paper-red?logo=arxiv&logoColor=red"
+      alt="STAR Paper on arXiv"
+    />
+  </a>
+  <a href="#">
+    <img
+      src="https://img.shields.io/badge/STAR-Project-0A66C2?logo=safari&logoColor=white"
+      alt="STAR Project"
+    />
+  </a>
+  <a href="#">
+    <img
+        src="https://img.shields.io/badge/STAR-Models-yellow?logo=huggingface&logoColor=yellow"
+        alt="STAR Models"
+    />
+  </a>
+  <a href="#">
+    <img
+      src="https://img.shields.io/badge/STAR-Demo-blue?logo=googleplay&logoColor=blue"
+      alt="STAR Demo"
+    />
+  </a>
+  <a href="#">
+    <img
+        src="https://img.shields.io/badge/STAR-Space-orange?logo=huggingface&logoColor=yellow"
+        alt="STAR HuggingFace Space"
+    />
+  </a>
+</p>
+# **STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning**
+Welcome to the official repository for our paper: "STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning"
+## **Abstract**
+Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce ***STAR***: *a **ST**acked **A**uto**R**egressive scheme for task-progressive unified multimodal learning*. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (**0.91**), DPG-Bench (**87.44**), and ImgEdit (**4.34**), validating its efficacy for unified multimodal learning.
+<div align="center">
+  <img src="assets/teaser.png" width=100%></img>
+</div>
+## 🌟 Model Checkpoint
+| Model Name | Checkpoint | Config |
+| :--------: | :--------: | :----: |
+| STAR-3B | [Link](#) | [Config](star/configs/STAR_Qwen2.5-VL-3B.json) |
+| STAR-7B | [Link](#) | [Config](star/configs/STAR_Qwen2.5-VL-7B.json) |
+| VQ Model | [Link](#) | - |
+## 📚 Preparation
+### Prepare the environment
+1. Set up environment
+```shell
+git clone <repository-url>
+cd STAR
+conda create -n star python==3.11 -y
+conda activate star
+```
+2. Install the required packages:
+```shell
+# upgrade pip and setuptools if necessary
+pip install -U pip setuptools
+# install required packages
+pip install -r requirements.txt
+```
+### Download Pre-trained Models
+Download the necessary pre-trained models before proceeding to inference.
+```shell
+STAR/checkpoints/STAR-7B.pt
+STAR/checkpoints/VQ-Model.pt
+```
+### Configuration
+The model configuration file `star/configs/STAR_Qwen2.5-VL-7B.json` contains all necessary parameters for model initialization. Make sure to update the paths in the configuration file to match your local setup.
+## 🔥 Quick Start
+### Demo
+Run the interactive demo interface using Gradio.
+```shell
+python3 gradio_app.py
+```
+### Inference
+### 1. Image Understanding
+For visual question answering and image understanding tasks:
+```shell
+python3 inference_understand.py \
+    --image-path "path/to/your/image.jpg" \
+    --question "What is in this image? Describe it in detail." \
+    --max-new-tokens 256 \
+    --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
+    --checkpoint "checkpoints/STAR-7B.pt" \
+    --device "cuda:0"
+```
+**Parameters:**
+- `--image-path`: Path to the input image
+- `--question`: Question or instruction for the model
+- `--max-new-tokens`: Maximum number of tokens to generate (default: 256)
+- `--model-config`: Path to model configuration file
+- `--checkpoint`: Path to model checkpoint
+- `--device`: Device to run inference on
+### 2. Text-to-Image Generation
+For generating images from text prompts:
+```shell
+python3 inference_generation.py \
+    --prompt "a photo of a cute cat" \
+    --save-path "./outputs/a photo of a cute cat.jpg" \
+    --num-images 1 \
+    --cfg 1.1 \
+    --topk 1000 \
+    --topp 0.8 \
+    --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
+    --checkpoint "checkpoints/STAR-7B.pt" \
+    --diffusion-as-decoder \
+    --device "cuda:0"
+```
+**Parameters:**
+- `--prompt`: Text prompt for image generation
+- `--save-path`: Path to save the generated image
+- `--num-images`: Number of images to generate (default: 1)
+- `--cfg`: Classifier-free guidance scale (default: 1.0)
+- `--topk`: Top-k sampling parameter (default: 1000)
+- `--topp`: Top-p sampling parameter (default: 0.8)
+- `--diffusion-as-decoder`: Use diffusion model as decoder for high-quality generation
+### 3. Image Editing
+For editing images based on text instructions:
+```shell
+python3 inference_edit.py \
+    --image-path "./outputs/a photo of a cute cat.jpg" \
+    --instruction "change the color of cat to blue" \
+    --save-path "./outputs/edited_image.jpg" \
+    --cfg 1.1 \
+    --topk 1000 \
+    --topp 0.8 \
+    --model-config "star/configs/STAR_Qwen2.5-VL-7B.json" \
+    --checkpoint "checkpoints/STAR-7B.pt" \
+    --diffusion-as-decoder \
+    --device "cuda:0"
+```
+**Parameters:**
+- `--image-path`: Path to the input image to be edited
+- `--instruction`: Text instruction describing the desired edit
+- `--save-path`: Path to save the edited image
+- `--cfg`: Classifier-free guidance scale for editing
+- `--topk`: Top-k sampling parameter
+- `--topp`: Top-p sampling parameter
+- `--diffusion-as-decoder`: Use diffusion model for high-quality image decoding
+## ✍️ Citation
+```bibtex
+@article{2025star,
+  title   = {STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning},
+  author  = {Qin, Jie and Huang, Jiancheng and Qiao, Limeng and Ma, Lin},
+  journal = {arXiv preprint arXiv:},
+  year    = {2025}
+}
+```
+## 📜 License
+STAR is licensed under the Apache 2.0.