UniCom:
Unified Multimodal Modeling via Compressed Continuous Semantic Representations
2 Zhejiang University
3 Tencent Hunyuan
Abstract
Current unified multimodal models typically rely on discrete visual tokenizers to bridge the modality gap. However, discretization inevitably discards fine-grained semantic information, leading to suboptimal performance in visual understanding tasks. Conversely, directly modeling continuous semantic representations (e.g., CLIP, SigLIP) poses significant challenges in high-dimensional generative modeling, resulting in slow convergence and training instability. To resolve this dilemma, we introduce UniCom, a unified framework that harmonizes multimodal understanding and generation via compressed continuous representation. We empirically demonstrate that reducing channel dimension is significantly more effective than spatial downsampling for both reconstruction and generation. Accordingly, we design an attention-based semantic compressor to distill dense features into a compact unified representation. Furthermore, we validate that the transfusion architecture surpasses query-based designs in convergence and consistency. Experiments demonstrate that UniCom achieves state-of-the-art generation performance among unified models. Notably, by preserving rich semantic priors, it delivers exceptional controllability in image editing and maintains image consistency even without relying on VAE.
π Model
This repository contains the following UniCom components:
unicom_hf_model/: the main UniCom checkpoint for unified multimodal generation and editing.unicom_decoder_transformer.pt: the decoder transformer checkpoint used to decode UniCom latent representations into images.flux-vae/: the decoder-side Flux VAE required at inference time.siglip2-so400m-patch16-naflex/: the SigLIP2 vision encoder required by the decoder for reconstruction and SigLIP2-based conditioning.
π Quick Start
Please see the project resources for setup and sample usage:
- GitHub repository: https://github.com/Tencent-Hunyuan/UniCom
- Project page: https://miazhao7708.github.io/UniComPage/
- Paper: https://arxiv.org/abs/2603.10702
βοΈ Citation
If you find UniCom useful for your research, please cite:
@misc{zhao2026unicomunifiedmultimodalmodeling,
title={UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations},
author={Yaqi Zhao and Wang Lin and Zijian Zhang and Miles Yang and Jingyuan Chen and Wentao Zhang and Zhao Zhong and Liefeng Bo},
year={2026},
eprint={2603.10702},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.10702},
}
License
UniCom is licensed under the License Terms of UniCom. See ./LICENSE.txt for more details.
- Downloads last month
- -