Add model card and metadata
#1
by nielsr HF Staff - opened
README.md
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: any-to-any
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# LatentUM
|
| 7 |
+
|
| 8 |
+
LatentUM is a unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation.
|
| 9 |
+
|
| 10 |
+
- **Paper:** [LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model](https://arxiv.org/abs/2604.02097)
|
| 11 |
+
- **Repository:** [https://github.com/SJTU-DENG-Lab/LatentUM](https://github.com/SJTU-DENG-Lab/LatentUM)
|
| 12 |
+
|
| 13 |
+
## Key Features
|
| 14 |
+
|
| 15 |
+
- **Shared Semantic Latent Space:** Text and visual tokens share the same space, enabling direct cross-modal reasoning over generated visual content.
|
| 16 |
+
- **MBAQ:** Visual tokenizer trained to preserve VLM understanding behavior rather than pixel reconstruction.
|
| 17 |
+
- **MoME:** Decoupled understanding/generation branches with shared self-attention for cross-modal interaction.
|
| 18 |
+
- **Decoupled Pixel Decoder:** Optional diffusion decoder for pixel rendering, trained independently to keep the latent space semantics-focused.
|
| 19 |
+
|
| 20 |
+
## Sample Usage
|
| 21 |
+
|
| 22 |
+
To use this model, first clone the [official repository](https://github.com/SJTU-DENG-Lab/LatentUM) and install the dependencies:
|
| 23 |
+
|
| 24 |
+
```bash
|
| 25 |
+
git clone https://github.com/SJTU-DENG-Lab/LatentUM.git
|
| 26 |
+
cd LatentUM
|
| 27 |
+
uv sync
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
### Image Understanding
|
| 31 |
+
|
| 32 |
+
```python
|
| 33 |
+
import torch
|
| 34 |
+
from model.latentum import LatentUMModel
|
| 35 |
+
|
| 36 |
+
dtype = torch.bfloat16
|
| 37 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 38 |
+
|
| 39 |
+
model = LatentUMModel.from_pretrained(
|
| 40 |
+
"SJTU-DENG-Lab/LatentUM-Base",
|
| 41 |
+
device = device,
|
| 42 |
+
dtype = dtype,
|
| 43 |
+
)
|
| 44 |
+
answer = model.answer(
|
| 45 |
+
"asset/blue_apple.png",
|
| 46 |
+
"Describe this image.",
|
| 47 |
+
)
|
| 48 |
+
print(answer)
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
### Image Generation
|
| 52 |
+
|
| 53 |
+
```python
|
| 54 |
+
import torch
|
| 55 |
+
from model.decoder import LatentUMDecoderModel
|
| 56 |
+
from model.latentum import LatentUMModel
|
| 57 |
+
|
| 58 |
+
dtype = torch.bfloat16
|
| 59 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 60 |
+
|
| 61 |
+
model = LatentUMModel.from_pretrained(
|
| 62 |
+
"SJTU-DENG-Lab/LatentUM-Base",
|
| 63 |
+
device = device,
|
| 64 |
+
dtype = dtype,
|
| 65 |
+
)
|
| 66 |
+
decoder = LatentUMDecoderModel.from_pretrained(
|
| 67 |
+
"SJTU-DENG-Lab/LatentUM-Decoder",
|
| 68 |
+
device=device,
|
| 69 |
+
dtype=dtype,
|
| 70 |
+
)
|
| 71 |
+
images = model.generate_images(
|
| 72 |
+
"a photo of a cute dog",
|
| 73 |
+
decoder = decoder,
|
| 74 |
+
show_progress = True,
|
| 75 |
+
)
|
| 76 |
+
images[0].save("generated.png")
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
## Citation
|
| 80 |
+
|
| 81 |
+
```bibtex
|
| 82 |
+
@article{jin2026latentum,
|
| 83 |
+
title = {LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model},
|
| 84 |
+
author = {Jiachun Jin and Zetong Zhou and Xiao Yang and Hao Zhang and Pengfei Liu and Jun Zhu and Zhijie Deng},
|
| 85 |
+
journal = {arXiv preprint arXiv:2604.02097},
|
| 86 |
+
year = {2026},
|
| 87 |
+
url = {https://arxiv.org/abs/2604.02097}
|
| 88 |
+
}
|
| 89 |
+
```
|