Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Yichen Zhang^1*, Da Peng^2*, Zonghao Guo^1†, Zijian Zhang³, Xuesong Yang³,

Tong Sun³, Shichu Sun³, Yidan Zhang³, Yanghao Li¹, Haiyan Zhao¹, Wang Xu¹,

Qi Shi¹, Yangang Sun¹, Chi Chen¹, Shuo Wang¹, Yukun Yan¹, Xu Han¹,

Qiang Ma¹, Wei Ke², Liang Wang³, Zhiyuan Liu¹, Maosong Sun¹

¹Tsinghua University, ²Xi'an Jiaotong University, ³University of Chinese Academy of Sciences

* Equal contribution † Corresponding author

🌟 What is Cheers?

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling.

Repository: [https://github.com/AI9Stars/Cheers]
Paper: [https://arxiv.org/abs/2603.12793]

Model Architecture

🔥 News

[2026/03/19] 🎉 Demo is now available on Hugging Face. Thanks to Prithiv Sakthi for setting it up!
[2026/03/16] 📢 The Cheers paper is officially released.
[2026/03/16] 🛠 We open-source the evaluation code and training pipeline. Our codebase is highly efficient: training on 3.8M samples takes only about two days on a single machine with 8×A100 GPUs.
[2026/03/16] 📦 The model checkpoints of Cheers are now available.

Uses

Generation

import os
import torch
from torchvision.utils import save_image
from transformers import AutoModelForCausalLM, AutoProcessor
os.environ["CUDA_VISIBLE_DEVICES"] = "7"
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["TRANSFORMERS_OFFLINE"] = "1"

ckpt = "ckpt_path"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

processor = AutoProcessor.from_pretrained(ckpt, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(ckpt, trust_remote_code=True)
model.to(device)
model = model.to(torch.bfloat16)
model.eval()

content = """
In the center of a bustling intersection, a large tree with a thick trunk and sprawling branches stands out amidst the concrete. 
Its green leaves contrast sharply with the grey asphalt roads that converge around it. Traffic lights and street signs are positioned awkwardly around the tree's base, creating an unusual juxtaposition of nature and urban infrastructure.
"""
images_batch = [None]

messages_batch = [
        [{"role": "user", "content": content}],
    ]

texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages_batch]
inputs = processor(text=texts, images=images_batch, return_tensors="pt", add_im_start_id=True)
inputs = {k: (v.to(device=device) if isinstance(v, torch.Tensor) else v) for k, v in inputs.items()}

gen_config = {
    "max_length": 300,
    "cfg_scale": 9.5,
    "temperature": 0.0,
    "num_inference_steps": 80,
    "alpha": 0.5,
    "edit_image": False,
}

inputs.update(gen_config)
generated = model.generate(**inputs)
input_ids = generated["input_ids"]
images = generated["images"][0]

current_img = images[0]
current_img = current_img.clamp(0.0, 1.0)
save_image(current_img, f"outputs/case_.png")

print(f"Save image: outputs/case_.png")

Understanding

import os
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["TRANSFORMERS_OFFLINE"] = "1"

ckpt = "ckpt_path"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

processor = AutoProcessor.from_pretrained(ckpt, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(ckpt, trust_remote_code=True)
model = model.to(torch.bfloat16)
model.to(device)
model.eval()

content = "<im_start><image><im_end>\n Discribe this image."

img = Image.open("fig/logo.png")
images_batch = [img,]

messages_batch = [
        [{"role": "user", "content": content}],
    ]

texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages_batch]
inputs = processor(text=texts, images=images_batch, return_tensors="pt", add_im_start_id=False)
inputs = {k: (v.to(device=device) if isinstance(v, torch.Tensor) else v) for k, v in inputs.items()}

gen_config = {
    "max_length": 150,
    "temperature": 0.3,
}

inputs.update(gen_config)
generated = model.generate(**inputs)
input_ids = generated["input_ids"]

print(processor.tokenizer.batch_decode(input_ids, skip_special_tokens=True))

Model Card Contact

For any questions or collaborations, feel free to contact us : )

📧 MetaPDa@gmail.com | 📧 guozonghao96@outlook.com | 📧 yichen0zhang@gmail.com

📖 Citation

If you find Cheers useful, please cite Cheers technical report using this BibTeX.

@article{zhang2026cheers,
  title={CHEERS: DECOUPLING PATCH DETAILS FROM SEMANTIC REPRESENTATIONS ENABLES UNIFIED MULTIMODAL COMPREHENSION AND GENERATION},
  author={Zhang, Yichen and Peng, Da and Guo, Zonghao and Zhang, Zijian and Yang, Xuesong and Sun, Tong and Sun, Shichu and Zhang, Yidan and Li, Yanghao and Zhao, Haiyan and others},
  journal={arXiv preprint arXiv:2603.12793},
  year={2026}
}

Downloads last month: 142,743

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ai9stars/Cheers

Space using ai9stars/Cheers 1

Paper for ai9stars/Cheers

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Paper • 2603.12793 • Published Mar 13 • 38