Instructions to use ShareLab-SII/UniAR-RL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use ShareLab-SII/UniAR-RL with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("ShareLab-SII/UniAR-RL", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
license: apache-2.0
language:
- en
tags:
- image-generation
- image-understanding
- image-editing
- multimodal
- autoregressive
- text-to-image
- unified-model
pipeline_tag: image-to-text
base_model: ShareLab-SII/UniAR-SFT
UniAR: Unified Multimodal Autoregressive Modeling with Shared Context--Visual Tokenizer is Key to Unification (ICML2026)
UniAR is a unified autoregressive multimodal model for image understanding, image generation, and image editing in a single Transformer. UniAR-RL is obtained by reinforcement learning (GRPO) on top of UniAR-SFT, achieving state-of-the-art text rendering and instruction-following performance among unified models.
Model Description
UniAR uses a single discrete visual tokenizer (BSQ) as the key bridge between understanding and generation, enabling a shared context where the model can directly interpret its own generated visual tokens. Key components:
- Backbone: Qwen3-8B
- Visual Tokenizer: BSQ-quantized SigLiP2-So400M ViT with DeepStack connections
- Visual Decoder: SD3.5-Medium DiT with SigLIP feature injection
- Training: Pre-training (1T tokens) → SFT → RL (GRPO with multi-reward stack)
This checkpoint (UniAR-RL) is the final RL-finetuned model with improved generation quality.
Checkpoint Contents
This is a self-contained checkpoint with all components needed for both understanding and generation:
| Component | Path | Description |
|---|---|---|
| AR model | *.safetensors |
Unified autoregressive model weights |
| BSQ encoder | bsq_encoder/ |
BSQ quantized image tokenizer |
| SD3 transformer | sd3_transformer/ |
SD3 transformer with visual feature injection |
| SD3 pipeline | sd3_pipeline/ |
SD3 VAE + text encoders |
Usage
Installation
conda create -n uniar python=3.12 -y
conda activate uniar
git clone https://github.com/ShareLab-SII/UniAR.git
cd UniAR
pip install -e . # inference dependencies
Image Understanding
import torch
from transformers import AutoProcessor
from uniar import UniARForConditionalGeneration
model_path = "ShareLab-SII/UniAR-RL"
model = UniARForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
).cuda().eval()
processor = AutoProcessor.from_pretrained(model_path)
messages = [{"role": "user", "content": [
{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
{"type": "text", "text": "Describe this image in detail."},
]}]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
inputs.pop("mm_token_type_ids", None)
with torch.no_grad(), torch.autocast("cuda", dtype=torch.bfloat16):
output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
output_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output_ids)]
print(processor.batch_decode(output_ids, skip_special_tokens=True)[0])
Image Generation
import torch
from transformers import AutoProcessor
from uniar import UniARForConditionalGeneration, UniARVisualDecoder
from inference.visual_inputs import prepare_visual_inputs
model_path = "ShareLab-SII/UniAR-RL"
device = torch.device("cuda")
ar_model = UniARForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
).to(device).eval()
processor = AutoProcessor.from_pretrained(model_path, padding_side="left")
visual_decoder = UniARVisualDecoder.from_pretrained(model_path, device=device)
# prepare inputs
visual_inputs = prepare_visual_inputs(
["A cute anime girl."],
ar_model,
processor,
ar_height=960,
ar_width=960,
)
# autogressively generate visual indices
indices = ar_model.generate_visual(
**visual_inputs,
temperature=1.0,
cfg=1.5,
show_progress=True,
)
# decode visual indices into image
images = visual_decoder.decode(
indices,
ar_height=960,
ar_width=960,
upsampling_ratio=1.067,
)
images[0].save("output.png")
Citation
@inproceedings{peng2026uniar,
title={Unified Multimodal Autoregressive Modeling with Shared Context --- Visual Tokenizer is Key to Unification},
author={Peng, Wujian and Meng, Lingchen and Cai, Yuxuan and Zhuang, Xianwei and Yang, Yuhuan and Fang, Rongyao and Wu, Chenfei and Lin, Junyang and Wu, Zuxuan and Bai, Shuai},
booktitle={ICML},
year={2026}
}