Add model card and metadata

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +89 -0
README.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: any-to-any
3
+ license: apache-2.0
4
+ ---
5
+
6
+ # LatentUM
7
+
8
+ LatentUM is a unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation.
9
+
10
+ - **Paper:** [LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model](https://arxiv.org/abs/2604.02097)
11
+ - **Repository:** [https://github.com/SJTU-DENG-Lab/LatentUM](https://github.com/SJTU-DENG-Lab/LatentUM)
12
+
13
+ ## Key Features
14
+
15
+ - **Shared Semantic Latent Space:** Text and visual tokens share the same space, enabling direct cross-modal reasoning over generated visual content.
16
+ - **MBAQ:** Visual tokenizer trained to preserve VLM understanding behavior rather than pixel reconstruction.
17
+ - **MoME:** Decoupled understanding/generation branches with shared self-attention for cross-modal interaction.
18
+ - **Decoupled Pixel Decoder:** Optional diffusion decoder for pixel rendering, trained independently to keep the latent space semantics-focused.
19
+
20
+ ## Sample Usage
21
+
22
+ To use this model, first clone the [official repository](https://github.com/SJTU-DENG-Lab/LatentUM) and install the dependencies:
23
+
24
+ ```bash
25
+ git clone https://github.com/SJTU-DENG-Lab/LatentUM.git
26
+ cd LatentUM
27
+ uv sync
28
+ ```
29
+
30
+ ### Image Understanding
31
+
32
+ ```python
33
+ import torch
34
+ from model.latentum import LatentUMModel
35
+
36
+ dtype = torch.bfloat16
37
+ device = "cuda" if torch.cuda.is_available() else "cpu"
38
+
39
+ model = LatentUMModel.from_pretrained(
40
+ "SJTU-DENG-Lab/LatentUM-Base",
41
+ device = device,
42
+ dtype = dtype,
43
+ )
44
+ answer = model.answer(
45
+ "asset/blue_apple.png",
46
+ "Describe this image.",
47
+ )
48
+ print(answer)
49
+ ```
50
+
51
+ ### Image Generation
52
+
53
+ ```python
54
+ import torch
55
+ from model.decoder import LatentUMDecoderModel
56
+ from model.latentum import LatentUMModel
57
+
58
+ dtype = torch.bfloat16
59
+ device = "cuda" if torch.cuda.is_available() else "cpu"
60
+
61
+ model = LatentUMModel.from_pretrained(
62
+ "SJTU-DENG-Lab/LatentUM-Base",
63
+ device = device,
64
+ dtype = dtype,
65
+ )
66
+ decoder = LatentUMDecoderModel.from_pretrained(
67
+ "SJTU-DENG-Lab/LatentUM-Decoder",
68
+ device=device,
69
+ dtype=dtype,
70
+ )
71
+ images = model.generate_images(
72
+ "a photo of a cute dog",
73
+ decoder = decoder,
74
+ show_progress = True,
75
+ )
76
+ images[0].save("generated.png")
77
+ ```
78
+
79
+ ## Citation
80
+
81
+ ```bibtex
82
+ @article{jin2026latentum,
83
+ title = {LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model},
84
+ author = {Jiachun Jin and Zetong Zhou and Xiao Yang and Hao Zhang and Pengfei Liu and Jun Zhu and Zhijie Deng},
85
+ journal = {arXiv preprint arXiv:2604.02097},
86
+ year = {2026},
87
+ url = {https://arxiv.org/abs/2604.02097}
88
+ }
89
+ ```