LatentUM-Decoder / README.md
nielsr's picture
nielsr HF Staff
Add model card for LatentUM
554e32c verified
|
raw
history blame
2.57 kB
metadata
license: apache-2.0
pipeline_tag: image-to-image

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

LatentUM unifies all modalities within a shared semantic latent space, enabling interleaved cross-modal reasoning without pixel-space mediation. Unlike existing unified models that require pixel decoding as a bridge between understanding and generation, LatentUM reasons directly over its own generated visual content.

This repository specifically contains the Pixel Decoder, an optional diffusion-based decoder (based on Stable Diffusion 3.5 Medium) designed to render pixel-space images from the shared semantic latents.

Sample Usage

To use this model, please follow the installation instructions in the official repository.

Image Understanding

import torch

from model.latentum import LatentUMModel

dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"

model = LatentUMModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Base",
    device = device,
    dtype  = dtype,
)
answer = model.answer(
    "asset/blue_apple.png",
    "Describe this image.",
)
print(answer)

Image Generation

import torch

from model.decoder import LatentUMDecoderModel
from model.latentum import LatentUMModel

dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"

model = LatentUMModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Base", # alternative: "SJTU-DENG-Lab/LatentUM-GenEval"
    device = device,
    dtype  = dtype,
)
decoder = LatentUMDecoderModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Decoder",
    device=device,
    dtype=dtype,
)
images = model.generate_images(
    "a photo of a cute dog",
    decoder       = decoder,
    show_progress = True,
)
images[0].save("generated.png")

Citation

@article{jin2026latentum,
  title   = {LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model},
  author  = {Jiachun Jin and Zetong Zhou and Xiao Yang and Hao Zhang and Pengfei Liu and Jun Zhu and Zhijie Deng},
  journal = {arXiv preprint arXiv:2604.02097},
  year    = {2026},
  url     = {https://arxiv.org/abs/2604.02097}
}