--- license: apache-2.0 language: - en tags: - image-generation - image-understanding - image-editing - multimodal - autoregressive - text-to-image - unified-model pipeline_tag: image-to-text base_model: ShareLab-SII/UniAR-SFT --- # UniAR: Unified Multimodal Autoregressive Modeling with Shared Context--Visual Tokenizer is Key to Unification (ICML2026) **UniAR** is a unified autoregressive multimodal model for **image understanding**, **image generation**, and **image editing** in a single Transformer. UniAR-RL is obtained by reinforcement learning (GRPO) on top of [UniAR-SFT](https://huggingface.co/ShareLab-SII/UniAR-SFT), achieving state-of-the-art text rendering and instruction-following performance among unified models. [![arXiv](https://img.shields.io/badge/arXiv-2606.18249-b31b1b.svg)](https://arxiv.org/abs/2606.18249) [![Project Page](https://img.shields.io/badge/Project-Page-blue.svg)](https://sharelab-sii.github.io/uniar-web) [![Code](https://img.shields.io/badge/GitHub-Code-black.svg)](https://github.com/ShareLab-SII/UniAR) ## Model Description UniAR uses a single discrete visual tokenizer (BSQ) as the key bridge between understanding and generation, enabling a shared context where the model can directly interpret its own generated visual tokens. Key components: - **Backbone:** Qwen3-8B - **Visual Tokenizer:** BSQ-quantized SigLiP2-So400M ViT with DeepStack connections - **Visual Decoder:** SD3.5-Medium DiT with SigLIP feature injection - **Training:** Pre-training (1T tokens) → SFT → RL (GRPO with multi-reward stack) This checkpoint (`UniAR-RL`) is the final RL-finetuned model with improved generation quality. ## Checkpoint Contents This is a self-contained checkpoint with all components needed for both understanding and generation: | Component | Path | Description | |-----------|------|-------------| | AR model | `*.safetensors` | Unified autoregressive model weights | | BSQ encoder | `bsq_encoder/` | BSQ quantized image tokenizer | | SD3 transformer | `sd3_transformer/` | SD3 transformer with visual feature injection | | SD3 pipeline | `sd3_pipeline/` | SD3 VAE + text encoders | ## Usage ### Installation ```bash conda create -n uniar python=3.12 -y conda activate uniar git clone https://github.com/ShareLab-SII/UniAR.git cd UniAR pip install -e . # inference dependencies ``` ### Image Understanding ```python import torch from transformers import AutoProcessor from uniar import UniARForConditionalGeneration model_path = "ShareLab-SII/UniAR-RL" model = UniARForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ).cuda().eval() processor = AutoProcessor.from_pretrained(model_path) messages = [{"role": "user", "content": [ {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}, {"type": "text", "text": "Describe this image in detail."}, ]}] inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", ).to(model.device) inputs.pop("mm_token_type_ids", None) with torch.no_grad(), torch.autocast("cuda", dtype=torch.bfloat16): output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False) output_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output_ids)] print(processor.batch_decode(output_ids, skip_special_tokens=True)[0]) ``` ### Image Generation ```python import torch from transformers import AutoProcessor from uniar import UniARForConditionalGeneration, UniARVisualDecoder from inference.visual_inputs import prepare_visual_inputs model_path = "ShareLab-SII/UniAR-RL" device = torch.device("cuda") ar_model = UniARForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ).to(device).eval() processor = AutoProcessor.from_pretrained(model_path, padding_side="left") visual_decoder = UniARVisualDecoder.from_pretrained(model_path, device=device) # prepare inputs visual_inputs = prepare_visual_inputs( ["A cute anime girl."], ar_model, processor, ar_height=960, ar_width=960, ) # autogressively generate visual indices indices = ar_model.generate_visual( **visual_inputs, temperature=1.0, cfg=1.5, show_progress=True, ) # decode visual indices into image images = visual_decoder.decode( indices, ar_height=960, ar_width=960, upsampling_ratio=1.067, ) images[0].save("output.png") ``` ## Citation ```bibtex @inproceedings{peng2026uniar, title={Unified Multimodal Autoregressive Modeling with Shared Context --- Visual Tokenizer is Key to Unification}, author={Peng, Wujian and Meng, Lingchen and Cai, Yuxuan and Zhuang, Xianwei and Yang, Yuhuan and Fang, Rongyao and Wu, Chenfei and Lin, Junyang and Wu, Zuxuan and Bai, Shuai}, booktitle={ICML}, year={2026} } ```