File size: 4,980 Bytes
b7761bb
 
7149f77
 
 
 
 
 
 
 
 
 
 
 
b7761bb
7149f77
 
 
 
 
8f70df6
7149f77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
license: apache-2.0
language:
  - en
tags:
  - image-generation
  - image-understanding
  - image-editing
  - multimodal
  - autoregressive
  - text-to-image
  - unified-model
pipeline_tag: image-to-text
base_model: ShareLab-SII/UniAR-SFT
---

# UniAR: Unified Multimodal Autoregressive Modeling with Shared Context--Visual Tokenizer is Key to Unification (ICML2026)

**UniAR** is a unified autoregressive multimodal model for **image understanding**, **image generation**, and **image editing** in a single Transformer. UniAR-RL is obtained by reinforcement learning (GRPO) on top of [UniAR-SFT](https://huggingface.co/ShareLab-SII/UniAR-SFT), achieving state-of-the-art text rendering and instruction-following performance among unified models.

[![arXiv](https://img.shields.io/badge/arXiv-2606.18249-b31b1b.svg)](https://arxiv.org/abs/2606.18249)
[![Project Page](https://img.shields.io/badge/Project-Page-blue.svg)](https://sharelab-sii.github.io/uniar-web)
[![Code](https://img.shields.io/badge/GitHub-Code-black.svg)](https://github.com/ShareLab-SII/UniAR)

## Model Description

UniAR uses a single discrete visual tokenizer (BSQ) as the key bridge between understanding and generation, enabling a shared context where the model can directly interpret its own generated visual tokens. Key components:

- **Backbone:** Qwen3-8B
- **Visual Tokenizer:** BSQ-quantized SigLiP2-So400M ViT with DeepStack connections
- **Visual Decoder:** SD3.5-Medium DiT with SigLIP feature injection
- **Training:** Pre-training (1T tokens) → SFT → RL (GRPO with multi-reward stack)

This checkpoint (`UniAR-RL`) is the final RL-finetuned model with improved generation quality.

## Checkpoint Contents

This is a self-contained checkpoint with all components needed for both understanding and generation:

| Component | Path | Description |
|-----------|------|-------------|
| AR model | `*.safetensors` | Unified autoregressive model weights |
| BSQ encoder | `bsq_encoder/` | BSQ quantized image tokenizer |
| SD3 transformer | `sd3_transformer/` | SD3 transformer with visual feature injection |
| SD3 pipeline | `sd3_pipeline/` | SD3 VAE + text encoders |

## Usage

### Installation

```bash
conda create -n uniar python=3.12 -y
conda activate uniar

git clone https://github.com/ShareLab-SII/UniAR.git
cd UniAR
pip install -e .            # inference dependencies
```

### Image Understanding

```python
import torch
from transformers import AutoProcessor
from uniar import UniARForConditionalGeneration

model_path = "ShareLab-SII/UniAR-RL"
model = UniARForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).cuda().eval()
processor = AutoProcessor.from_pretrained(model_path)

messages = [{"role": "user", "content": [
    {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
    {"type": "text", "text": "Describe this image in detail."},
]}]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)
inputs.pop("mm_token_type_ids", None)

with torch.no_grad(), torch.autocast("cuda", dtype=torch.bfloat16):
    output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
output_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output_ids)]

print(processor.batch_decode(output_ids, skip_special_tokens=True)[0])

```

### Image Generation

```python
import torch
from transformers import AutoProcessor
from uniar import UniARForConditionalGeneration, UniARVisualDecoder
from inference.visual_inputs import prepare_visual_inputs

model_path = "ShareLab-SII/UniAR-RL"
device = torch.device("cuda")

ar_model = UniARForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).to(device).eval()
processor = AutoProcessor.from_pretrained(model_path, padding_side="left")
visual_decoder = UniARVisualDecoder.from_pretrained(model_path, device=device)

# prepare inputs
visual_inputs = prepare_visual_inputs(
    ["A cute anime girl."],
    ar_model,
    processor,
    ar_height=960,
    ar_width=960,
)

# autogressively generate visual indices
indices = ar_model.generate_visual(
    **visual_inputs,
    temperature=1.0,
    cfg=1.5,
    show_progress=True,
)

# decode visual indices into image
images = visual_decoder.decode(
    indices,
    ar_height=960,
    ar_width=960,
    upsampling_ratio=1.067,
)

images[0].save("output.png")

```

## Citation

```bibtex
@inproceedings{peng2026uniar,
  title={Unified Multimodal Autoregressive Modeling with Shared Context --- Visual Tokenizer is Key to Unification},
  author={Peng, Wujian and Meng, Lingchen and Cai, Yuxuan and Zhuang, Xianwei and Yang, Yuhuan and Fang, Rongyao and Wu, Chenfei and Lin, Junyang and Wu, Zuxuan and Bai, Shuai},
  booktitle={ICML},
  year={2026}
}
```