Instructions to use SJTU-DENG-Lab/LatentUM-Decoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use SJTU-DENG-Lab/LatentUM-Decoder with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("SJTU-DENG-Lab/LatentUM-Decoder", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| pipeline_tag: image-to-image | |
| # LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model | |
| **LatentUM** unifies all modalities within a shared semantic latent space, enabling interleaved cross-modal reasoning without pixel-space mediation. Unlike existing unified models that require pixel decoding as a bridge between understanding and generation, LatentUM reasons directly over its own generated visual content. | |
| This repository specifically contains the **Pixel Decoder**, an optional diffusion-based decoder (based on Stable Diffusion 3.5 Medium) designed to render pixel-space images from the shared semantic latents. | |
| - **Paper:** [LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model](https://huggingface.co/papers/2604.02097) | |
| - **Repository:** [https://github.com/SJTU-DENG-Lab/LatentUM](https://github.com/SJTU-DENG-Lab/LatentUM) | |
| ## Sample Usage | |
| To use this model, please follow the installation instructions in the [official repository](https://github.com/SJTU-DENG-Lab/LatentUM). | |
| ### Image Understanding | |
| ```python | |
| import torch | |
| from model.latentum import LatentUMModel | |
| dtype = torch.bfloat16 | |
| device = "cuda" if torch.cuda.is_available() else "cpu" | |
| model = LatentUMModel.from_pretrained( | |
| "SJTU-DENG-Lab/LatentUM-Base", | |
| device = device, | |
| dtype = dtype, | |
| ) | |
| answer = model.answer( | |
| "asset/blue_apple.png", | |
| "Describe this image.", | |
| ) | |
| print(answer) | |
| ``` | |
| ### Image Generation | |
| ```python | |
| import torch | |
| from model.decoder import LatentUMDecoderModel | |
| from model.latentum import LatentUMModel | |
| dtype = torch.bfloat16 | |
| device = "cuda" if torch.cuda.is_available() else "cpu" | |
| model = LatentUMModel.from_pretrained( | |
| "SJTU-DENG-Lab/LatentUM-Base", # alternative: "SJTU-DENG-Lab/LatentUM-GenEval" | |
| device = device, | |
| dtype = dtype, | |
| ) | |
| decoder = LatentUMDecoderModel.from_pretrained( | |
| "SJTU-DENG-Lab/LatentUM-Decoder", | |
| device=device, | |
| dtype=dtype, | |
| ) | |
| images = model.generate_images( | |
| "a photo of a cute dog", | |
| decoder = decoder, | |
| show_progress = True, | |
| ) | |
| images[0].save("generated.png") | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @article{jin2026latentum, | |
| title = {LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model}, | |
| author = {Jiachun Jin and Zetong Zhou and Xiao Yang and Hao Zhang and Pengfei Liu and Jun Zhu and Zhijie Deng}, | |
| journal = {arXiv preprint arXiv:2604.02097}, | |
| year = {2026}, | |
| url = {https://arxiv.org/abs/2604.02097} | |
| } | |
| ``` |