nielsr HF Staff

Add model card for LatentUM

554e32c verified about 2 months ago

2.57 kB

license: apache-2.0
pipeline_tag: image-to-image

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

LatentUM unifies all modalities within a shared semantic latent space, enabling interleaved cross-modal reasoning without pixel-space mediation. Unlike existing unified models that require pixel decoding as a bridge between understanding and generation, LatentUM reasons directly over its own generated visual content.

This repository specifically contains the Pixel Decoder, an optional diffusion-based decoder (based on Stable Diffusion 3.5 Medium) designed to render pixel-space images from the shared semantic latents.

Paper: LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
Repository: https://github.com/SJTU-DENG-Lab/LatentUM

Sample Usage

To use this model, please follow the installation instructions in the official repository.

Image Understanding

import torch

from model.latentum import LatentUMModel

dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"

model = LatentUMModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Base",
    device = device,
    dtype  = dtype,
)
answer = model.answer(
    "asset/blue_apple.png",
    "Describe this image.",
)
print(answer)

Image Generation

import torch

from model.decoder import LatentUMDecoderModel
from model.latentum import LatentUMModel

dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"

model = LatentUMModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Base", # alternative: "SJTU-DENG-Lab/LatentUM-GenEval"
    device = device,
    dtype  = dtype,
)
decoder = LatentUMDecoderModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Decoder",
    device=device,
    dtype=dtype,
)
images = model.generate_images(
    "a photo of a cute dog",
    decoder       = decoder,
    show_progress = True,
)
images[0].save("generated.png")

Citation

@article{jin2026latentum,
  title   = {LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model},
  author  = {Jiachun Jin and Zetong Zhou and Xiao Yang and Hao Zhang and Pengfei Liu and Jun Zhu and Zhijie Deng},
  journal = {arXiv preprint arXiv:2604.02097},
  year    = {2026},
  url     = {https://arxiv.org/abs/2604.02097}
}