Instructions to use SJTU-DENG-Lab/LatentUM-Decoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use SJTU-DENG-Lab/LatentUM-Decoder with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("SJTU-DENG-Lab/LatentUM-Decoder", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
metadata
license: apache-2.0
pipeline_tag: image-to-image
LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
LatentUM unifies all modalities within a shared semantic latent space, enabling interleaved cross-modal reasoning without pixel-space mediation. Unlike existing unified models that require pixel decoding as a bridge between understanding and generation, LatentUM reasons directly over its own generated visual content.
This repository specifically contains the Pixel Decoder, an optional diffusion-based decoder (based on Stable Diffusion 3.5 Medium) designed to render pixel-space images from the shared semantic latents.
- Paper: LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
- Repository: https://github.com/SJTU-DENG-Lab/LatentUM
Sample Usage
To use this model, please follow the installation instructions in the official repository.
Image Understanding
import torch
from model.latentum import LatentUMModel
dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"
model = LatentUMModel.from_pretrained(
"SJTU-DENG-Lab/LatentUM-Base",
device = device,
dtype = dtype,
)
answer = model.answer(
"asset/blue_apple.png",
"Describe this image.",
)
print(answer)
Image Generation
import torch
from model.decoder import LatentUMDecoderModel
from model.latentum import LatentUMModel
dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"
model = LatentUMModel.from_pretrained(
"SJTU-DENG-Lab/LatentUM-Base", # alternative: "SJTU-DENG-Lab/LatentUM-GenEval"
device = device,
dtype = dtype,
)
decoder = LatentUMDecoderModel.from_pretrained(
"SJTU-DENG-Lab/LatentUM-Decoder",
device=device,
dtype=dtype,
)
images = model.generate_images(
"a photo of a cute dog",
decoder = decoder,
show_progress = True,
)
images[0].save("generated.png")
Citation
@article{jin2026latentum,
title = {LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model},
author = {Jiachun Jin and Zetong Zhou and Xiao Yang and Hao Zhang and Pengfei Liu and Jun Zhu and Zhijie Deng},
journal = {arXiv preprint arXiv:2604.02097},
year = {2026},
url = {https://arxiv.org/abs/2604.02097}
}