--- license: apache-2.0 pipeline_tag: image-to-image --- # LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model **LatentUM** unifies all modalities within a shared semantic latent space, enabling interleaved cross-modal reasoning without pixel-space mediation. Unlike existing unified models that require pixel decoding as a bridge between understanding and generation, LatentUM reasons directly over its own generated visual content. This repository specifically contains the **Pixel Decoder**, an optional diffusion-based decoder (based on Stable Diffusion 3.5 Medium) designed to render pixel-space images from the shared semantic latents. - **Paper:** [LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model](https://huggingface.co/papers/2604.02097) - **Repository:** [https://github.com/SJTU-DENG-Lab/LatentUM](https://github.com/SJTU-DENG-Lab/LatentUM) ## Sample Usage To use this model, please follow the installation instructions in the [official repository](https://github.com/SJTU-DENG-Lab/LatentUM). ### Image Understanding ```python import torch from model.latentum import LatentUMModel dtype = torch.bfloat16 device = "cuda" if torch.cuda.is_available() else "cpu" model = LatentUMModel.from_pretrained( "SJTU-DENG-Lab/LatentUM-Base", device = device, dtype = dtype, ) answer = model.answer( "asset/blue_apple.png", "Describe this image.", ) print(answer) ``` ### Image Generation ```python import torch from model.decoder import LatentUMDecoderModel from model.latentum import LatentUMModel dtype = torch.bfloat16 device = "cuda" if torch.cuda.is_available() else "cpu" model = LatentUMModel.from_pretrained( "SJTU-DENG-Lab/LatentUM-Base", # alternative: "SJTU-DENG-Lab/LatentUM-GenEval" device = device, dtype = dtype, ) decoder = LatentUMDecoderModel.from_pretrained( "SJTU-DENG-Lab/LatentUM-Decoder", device=device, dtype=dtype, ) images = model.generate_images( "a photo of a cute dog", decoder = decoder, show_progress = True, ) images[0].save("generated.png") ``` ## Citation ```bibtex @article{jin2026latentum, title = {LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model}, author = {Jiachun Jin and Zetong Zhou and Xiao Yang and Hao Zhang and Pengfei Liu and Jun Zhu and Zhijie Deng}, journal = {arXiv preprint arXiv:2604.02097}, year = {2026}, url = {https://arxiv.org/abs/2604.02097} } ```