Rem520 commited on
Commit
d9d2239
·
verified ·
1 Parent(s): 0dad8aa

Add model card (PLUME-7B: latent-reasoning universal multimodal embedding)

Browse files
Files changed (1) hide show
  1. README.md +63 -0
README.md CHANGED
@@ -1,3 +1,66 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ base_model:
4
+ - zhibinlan/UME-R1-7B
5
+ language:
6
+ - en
7
+ tags:
8
+ - multimodal-embedding
9
+ - universal-multimodal-embedding
10
+ - retrieval
11
+ - latent-reasoning
12
+ - mllm
13
+ - qwen2-vl
14
+ pipeline_tag: feature-extraction
15
+ library_name: transformers
16
  ---
17
+
18
+ # PLUME-7B
19
+
20
+ **PLUME** (Latent Reasoning Based Universal Multimodal Embedding) is a 7B universal multimodal embedding model that maps heterogeneous inputs — text, images, videos, and visual documents — into a single shared retrieval space.
21
+
22
+ Recent universal multimodal embedding (UME) methods improve retrieval by generating explicit chain-of-thought (CoT) rationales before extracting an embedding. This is effective but slow, and it forces rich multimodal evidence through a narrow textual bottleneck. PLUME instead replaces verbalized CoT with a **short autoregressive rollout of continuous latent states**, and uses a **semantic-anchor-guided transition adapter** to steer the latent computation along input-dependent reasoning trajectories under a fixed compute budget. The model is trained with a **progressive explicit-to-latent curriculum** that uses verbalized reasoning as a temporary training scaffold and gradually transfers it into hidden-state computation, eliminating explicit CoT at inference.
23
+
24
+ This checkpoint is built on the **UME-R1-7B** backbone (Qwen2-VL-7B architecture).
25
+
26
+ ## Highlights
27
+
28
+ - **Universal**: a single model for text / image / video / visual-document embeddings.
29
+ - **Latent reasoning**: fewer than 10 latent steps replace hundreds of generated CoT tokens, giving **>30× faster inference** than explicit-CoT UME at comparable or better quality.
30
+ - **Strong retrieval**: evaluated on the 78-task **MMEB-v2** benchmark, outperforming strong explicit-CoT UME baselines — especially where evidence is dense and structurally complex (video and visual-document retrieval).
31
+
32
+ ## Model details
33
+
34
+ - **Backbone**: [`zhibinlan/UME-R1-7B`](https://huggingface.co/zhibinlan/UME-R1-7B) (Qwen2-VL-7B, `Qwen2VLForConditionalGeneration`)
35
+ - **Parameters**: ~7B, weights in half precision (4 safetensors shards, ~17 GB)
36
+ - **License**: Apache-2.0
37
+
38
+ ## Usage
39
+
40
+ The weights load as a standard Qwen2-VL checkpoint:
41
+
42
+ ```python
43
+ from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
44
+
45
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
46
+ "Rem520/PLUME-7B", torch_dtype="auto", device_map="auto"
47
+ )
48
+ processor = AutoProcessor.from_pretrained("Rem520/PLUME-7B")
49
+ ```
50
+
51
+ To use the full PLUME embedding pipeline (latent rollout + semantic-anchor-guided transition adapter), follow the official code: **https://github.com/haoxiangzhao12138/PLUME**
52
+
53
+ ## Citation
54
+
55
+ ```bibtex
56
+ @article{he2026plume,
57
+ title = {PLUME: Latent Reasoning Based Universal Multimodal Embedding},
58
+ author = {He, Chenwei and Hao, Xiangzhao and Yang, Tianyu and Ma, Yuxiang and
59
+ Jia, Yuheng and Wu, Lingxiang and Zhao, Chaoyang and Guo, Haiyun and Wang, Jinqiao},
60
+ journal = {arXiv preprint arXiv:2604.02073},
61
+ year = {2026}
62
+ }
63
+ ```
64
+
65
+ - **Paper**: [arXiv:2604.02073](https://arxiv.org/abs/2604.02073)
66
+ - **Code**: [github.com/haoxiangzhao12138/PLUME](https://github.com/haoxiangzhao12138/PLUME)