File size: 3,154 Bytes
9e61ffd
 
d9d2239
 
 
 
 
 
 
 
 
 
 
 
 
9e61ffd
d9d2239
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
license: apache-2.0
base_model:
- zhibinlan/UME-R1-7B
language:
- en
tags:
- multimodal-embedding
- universal-multimodal-embedding
- retrieval
- latent-reasoning
- mllm
- qwen2-vl
pipeline_tag: feature-extraction
library_name: transformers
---

# PLUME-7B

**PLUME** (Latent Reasoning Based Universal Multimodal Embedding) is a 7B universal multimodal embedding model that maps heterogeneous inputs — text, images, videos, and visual documents — into a single shared retrieval space.

Recent universal multimodal embedding (UME) methods improve retrieval by generating explicit chain-of-thought (CoT) rationales before extracting an embedding. This is effective but slow, and it forces rich multimodal evidence through a narrow textual bottleneck. PLUME instead replaces verbalized CoT with a **short autoregressive rollout of continuous latent states**, and uses a **semantic-anchor-guided transition adapter** to steer the latent computation along input-dependent reasoning trajectories under a fixed compute budget. The model is trained with a **progressive explicit-to-latent curriculum** that uses verbalized reasoning as a temporary training scaffold and gradually transfers it into hidden-state computation, eliminating explicit CoT at inference.

This checkpoint is built on the **UME-R1-7B** backbone (Qwen2-VL-7B architecture).

## Highlights

- **Universal**: a single model for text / image / video / visual-document embeddings.
- **Latent reasoning**: fewer than 10 latent steps replace hundreds of generated CoT tokens, giving **>30× faster inference** than explicit-CoT UME at comparable or better quality.
- **Strong retrieval**: evaluated on the 78-task **MMEB-v2** benchmark, outperforming strong explicit-CoT UME baselines — especially where evidence is dense and structurally complex (video and visual-document retrieval).

## Model details

- **Backbone**: [`zhibinlan/UME-R1-7B`](https://huggingface.co/zhibinlan/UME-R1-7B) (Qwen2-VL-7B, `Qwen2VLForConditionalGeneration`)
- **Parameters**: ~7B, weights in half precision (4 safetensors shards, ~17 GB)
- **License**: Apache-2.0

## Usage

The weights load as a standard Qwen2-VL checkpoint:

```python
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Rem520/PLUME-7B", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Rem520/PLUME-7B")
```

To use the full PLUME embedding pipeline (latent rollout + semantic-anchor-guided transition adapter), follow the official code: **https://github.com/haoxiangzhao12138/PLUME**

## Citation

```bibtex
@article{he2026plume,
  title   = {PLUME: Latent Reasoning Based Universal Multimodal Embedding},
  author  = {He, Chenwei and Hao, Xiangzhao and Yang, Tianyu and Ma, Yuxiang and
             Jia, Yuheng and Wu, Lingxiang and Zhao, Chaoyang and Guo, Haiyun and Wang, Jinqiao},
  journal = {arXiv preprint arXiv:2604.02073},
  year    = {2026}
}
```

- **Paper**: [arXiv:2604.02073](https://arxiv.org/abs/2604.02073)
- **Code**: [github.com/haoxiangzhao12138/PLUME](https://github.com/haoxiangzhao12138/PLUME)