Image-to-Text
Diffusers
Safetensors
English
uniar
image-generation
image-understanding
image-editing
multimodal
autoregressive
text-to-image
unified-model
Instructions to use ShareLab-SII/UniAR-RL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use ShareLab-SII/UniAR-RL with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("ShareLab-SII/UniAR-RL", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| tags: | |
| - image-generation | |
| - image-understanding | |
| - image-editing | |
| - multimodal | |
| - autoregressive | |
| - text-to-image | |
| - unified-model | |
| pipeline_tag: image-to-text | |
| base_model: ShareLab-SII/UniAR-SFT | |
| # UniAR: Unified Multimodal Autoregressive Modeling with Shared Context--Visual Tokenizer is Key to Unification (ICML2026) | |
| **UniAR** is a unified autoregressive multimodal model for **image understanding**, **image generation**, and **image editing** in a single Transformer. UniAR-RL is obtained by reinforcement learning (GRPO) on top of [UniAR-SFT](https://huggingface.co/ShareLab-SII/UniAR-SFT), achieving state-of-the-art text rendering and instruction-following performance among unified models. | |
| [](https://arxiv.org/abs/2606.18249) | |
| [](https://sharelab-sii.github.io/uniar-web) | |
| [](https://github.com/ShareLab-SII/UniAR) | |
| ## Model Description | |
| UniAR uses a single discrete visual tokenizer (BSQ) as the key bridge between understanding and generation, enabling a shared context where the model can directly interpret its own generated visual tokens. Key components: | |
| - **Backbone:** Qwen3-8B | |
| - **Visual Tokenizer:** BSQ-quantized SigLiP2-So400M ViT with DeepStack connections | |
| - **Visual Decoder:** SD3.5-Medium DiT with SigLIP feature injection | |
| - **Training:** Pre-training (1T tokens) → SFT → RL (GRPO with multi-reward stack) | |
| This checkpoint (`UniAR-RL`) is the final RL-finetuned model with improved generation quality. | |
| ## Checkpoint Contents | |
| This is a self-contained checkpoint with all components needed for both understanding and generation: | |
| | Component | Path | Description | | |
| |-----------|------|-------------| | |
| | AR model | `*.safetensors` | Unified autoregressive model weights | | |
| | BSQ encoder | `bsq_encoder/` | BSQ quantized image tokenizer | | |
| | SD3 transformer | `sd3_transformer/` | SD3 transformer with visual feature injection | | |
| | SD3 pipeline | `sd3_pipeline/` | SD3 VAE + text encoders | | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| conda create -n uniar python=3.12 -y | |
| conda activate uniar | |
| git clone https://github.com/ShareLab-SII/UniAR.git | |
| cd UniAR | |
| pip install -e . # inference dependencies | |
| ``` | |
| ### Image Understanding | |
| ```python | |
| import torch | |
| from transformers import AutoProcessor | |
| from uniar import UniARForConditionalGeneration | |
| model_path = "ShareLab-SII/UniAR-RL" | |
| model = UniARForConditionalGeneration.from_pretrained( | |
| model_path, | |
| torch_dtype=torch.bfloat16, | |
| attn_implementation="flash_attention_2", | |
| ).cuda().eval() | |
| processor = AutoProcessor.from_pretrained(model_path) | |
| messages = [{"role": "user", "content": [ | |
| {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}, | |
| {"type": "text", "text": "Describe this image in detail."}, | |
| ]}] | |
| inputs = processor.apply_chat_template( | |
| messages, | |
| tokenize=True, | |
| add_generation_prompt=True, | |
| return_dict=True, | |
| return_tensors="pt", | |
| ).to(model.device) | |
| inputs.pop("mm_token_type_ids", None) | |
| with torch.no_grad(), torch.autocast("cuda", dtype=torch.bfloat16): | |
| output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False) | |
| output_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output_ids)] | |
| print(processor.batch_decode(output_ids, skip_special_tokens=True)[0]) | |
| ``` | |
| ### Image Generation | |
| ```python | |
| import torch | |
| from transformers import AutoProcessor | |
| from uniar import UniARForConditionalGeneration, UniARVisualDecoder | |
| from inference.visual_inputs import prepare_visual_inputs | |
| model_path = "ShareLab-SII/UniAR-RL" | |
| device = torch.device("cuda") | |
| ar_model = UniARForConditionalGeneration.from_pretrained( | |
| model_path, | |
| torch_dtype=torch.bfloat16, | |
| attn_implementation="flash_attention_2", | |
| ).to(device).eval() | |
| processor = AutoProcessor.from_pretrained(model_path, padding_side="left") | |
| visual_decoder = UniARVisualDecoder.from_pretrained(model_path, device=device) | |
| # prepare inputs | |
| visual_inputs = prepare_visual_inputs( | |
| ["A cute anime girl."], | |
| ar_model, | |
| processor, | |
| ar_height=960, | |
| ar_width=960, | |
| ) | |
| # autogressively generate visual indices | |
| indices = ar_model.generate_visual( | |
| **visual_inputs, | |
| temperature=1.0, | |
| cfg=1.5, | |
| show_progress=True, | |
| ) | |
| # decode visual indices into image | |
| images = visual_decoder.decode( | |
| indices, | |
| ar_height=960, | |
| ar_width=960, | |
| upsampling_ratio=1.067, | |
| ) | |
| images[0].save("output.png") | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{peng2026uniar, | |
| title={Unified Multimodal Autoregressive Modeling with Shared Context --- Visual Tokenizer is Key to Unification}, | |
| author={Peng, Wujian and Meng, Lingchen and Cai, Yuxuan and Zhuang, Xianwei and Yang, Yuhuan and Fang, Rongyao and Wu, Chenfei and Lin, Junyang and Wu, Zuxuan and Bai, Shuai}, | |
| booktitle={ICML}, | |
| year={2026} | |
| } | |
| ``` | |