UniCom:
Unified Multimodal Modeling via Compressed Continuous Semantic Representations

Yaqi Zhao^1,3*, Wang Lin^2,3*, Zijian Zhang³, Miles Yang³, Jingyuan Chen^2†, Wentao Zhang^1†, Zhao Zhong³, Liefeng Bo³

¹ Peking University
² Zhejiang University
³ Tencent Hunyuan

Abstract

Current unified multimodal models typically rely on discrete visual tokenizers to bridge the modality gap. However, discretization inevitably discards fine-grained semantic information, leading to suboptimal performance in visual understanding tasks. Conversely, directly modeling continuous semantic representations (e.g., CLIP, SigLIP) poses significant challenges in high-dimensional generative modeling, resulting in slow convergence and training instability. To resolve this dilemma, we introduce UniCom, a unified framework that harmonizes multimodal understanding and generation via compressed continuous representation. We empirically demonstrate that reducing channel dimension is significantly more effective than spatial downsampling for both reconstruction and generation. Accordingly, we design an attention-based semantic compressor to distill dense features into a compact unified representation. Furthermore, we validate that the transfusion architecture surpasses query-based designs in convergence and consistency. Experiments demonstrate that UniCom achieves state-of-the-art generation performance among unified models. Notably, by preserving rich semantic priors, it delivers exceptional controllability in image editing and maintains image consistency even without relying on VAE.

🌟 Model

This repository contains the following UniCom components:

unicom_hf_model/: the main UniCom checkpoint for unified multimodal generation and editing.
unicom_decoder_transformer.pt: the decoder transformer checkpoint used to decode UniCom latent representations into images.
flux-vae/: the decoder-side Flux VAE required at inference time.
siglip2-so400m-patch16-naflex/: the SigLIP2 vision encoder required by the decoder for reconstruction and SigLIP2-based conditioning.

🚀 Quick Start

Please see the project resources for setup and sample usage:

GitHub repository: https://github.com/Tencent-Hunyuan/UniCom
Project page: https://miazhao7708.github.io/UniComPage/
Paper: https://arxiv.org/abs/2603.10702

✏️ Citation

If you find UniCom useful for your research, please cite:

@misc{zhao2026unicomunifiedmultimodalmodeling,
  title={UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations},
  author={Yaqi Zhao and Wang Lin and Zijian Zhang and Miles Yang and Jingyuan Chen and Wentao Zhang and Zhao Zhong and Liefeng Bo},
  year={2026},
  eprint={2603.10702},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.10702},
}

License

UniCom is licensed under the License Terms of UniCom. See ./LICENSE.txt for more details.

Downloads last month: -

Paper for tencent/Unicom-Unified-Multimodal-Modeling-via-Compressed-Continuous-Semantic-Representations

UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations

Paper • 2603.10702 • Published Mar 11 • 4

UniCom:Unified Multimodal Modeling via Compressed Continuous Semantic Representations