SJTU-DENG-Lab
/

LatentUM-Decoder

Model card Files Files and versions

LatentUM-Decoder / README.md

nielsr's picture

nielsr HF Staff

Add model card for LatentUM

554e32c verified about 2 months ago

|

2.57 kB

	---
	license: apache-2.0
	pipeline_tag: image-to-image
	---

	# LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

	LatentUM unifies all modalities within a shared semantic latent space, enabling interleaved cross-modal reasoning without pixel-space mediation. Unlike existing unified models that require pixel decoding as a bridge between understanding and generation, LatentUM reasons directly over its own generated visual content.

	This repository specifically contains the Pixel Decoder, an optional diffusion-based decoder (based on Stable Diffusion 3.5 Medium) designed to render pixel-space images from the shared semantic latents.

	- Paper: [LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model](https://huggingface.co/papers/2604.02097)
	- Repository: [https://github.com/SJTU-DENG-Lab/LatentUM](https://github.com/SJTU-DENG-Lab/LatentUM)

	## Sample Usage

	To use this model, please follow the installation instructions in the [official repository](https://github.com/SJTU-DENG-Lab/LatentUM).

	### Image Understanding

	```python
	import torch

	from model.latentum import LatentUMModel

	dtype = torch.bfloat16
	device = "cuda" if torch.cuda.is_available() else "cpu"

	model = LatentUMModel.from_pretrained(
	"SJTU-DENG-Lab/LatentUM-Base",
	device = device,
	dtype = dtype,
	)
	answer = model.answer(
	"asset/blue_apple.png",
	"Describe this image.",
	)
	print(answer)
	```

	### Image Generation

	```python
	import torch

	from model.decoder import LatentUMDecoderModel
	from model.latentum import LatentUMModel

	dtype = torch.bfloat16
	device = "cuda" if torch.cuda.is_available() else "cpu"

	model = LatentUMModel.from_pretrained(
	"SJTU-DENG-Lab/LatentUM-Base", # alternative: "SJTU-DENG-Lab/LatentUM-GenEval"
	device = device,
	dtype = dtype,
	)
	decoder = LatentUMDecoderModel.from_pretrained(
	"SJTU-DENG-Lab/LatentUM-Decoder",
	device=device,
	dtype=dtype,
	)
	images = model.generate_images(
	"a photo of a cute dog",
	decoder = decoder,
	show_progress = True,
	)
	images[0].save("generated.png")
	```

	## Citation

	```bibtex
	@article{jin2026latentum,
	title = {LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model},
	author = {Jiachun Jin and Zetong Zhou and Xiao Yang and Hao Zhang and Pengfei Liu and Jun Zhu and Zhijie Deng},
	journal = {arXiv preprint arXiv:2604.02097},
	year = {2026},
	url = {https://arxiv.org/abs/2604.02097}
	}
	```