Feature Extraction
Transformers
Safetensors
English
qwen2_vl
image-text-to-text
multimodal-embedding
universal-multimodal-embedding
retrieval
latent-reasoning
mllm
qwen2-vl
Instructions to use Rem520/PLUME-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Rem520/PLUME-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Rem520/PLUME-7B")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Rem520/PLUME-7B") model = AutoModelForImageTextToText.from_pretrained("Rem520/PLUME-7B") - Notebooks
- Google Colab
- Kaggle
Add model card (PLUME-7B: latent-reasoning universal multimodal embedding)
Browse files
README.md
CHANGED
|
@@ -1,3 +1,66 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
base_model:
|
| 4 |
+
- zhibinlan/UME-R1-7B
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
tags:
|
| 8 |
+
- multimodal-embedding
|
| 9 |
+
- universal-multimodal-embedding
|
| 10 |
+
- retrieval
|
| 11 |
+
- latent-reasoning
|
| 12 |
+
- mllm
|
| 13 |
+
- qwen2-vl
|
| 14 |
+
pipeline_tag: feature-extraction
|
| 15 |
+
library_name: transformers
|
| 16 |
---
|
| 17 |
+
|
| 18 |
+
# PLUME-7B
|
| 19 |
+
|
| 20 |
+
**PLUME** (Latent Reasoning Based Universal Multimodal Embedding) is a 7B universal multimodal embedding model that maps heterogeneous inputs — text, images, videos, and visual documents — into a single shared retrieval space.
|
| 21 |
+
|
| 22 |
+
Recent universal multimodal embedding (UME) methods improve retrieval by generating explicit chain-of-thought (CoT) rationales before extracting an embedding. This is effective but slow, and it forces rich multimodal evidence through a narrow textual bottleneck. PLUME instead replaces verbalized CoT with a **short autoregressive rollout of continuous latent states**, and uses a **semantic-anchor-guided transition adapter** to steer the latent computation along input-dependent reasoning trajectories under a fixed compute budget. The model is trained with a **progressive explicit-to-latent curriculum** that uses verbalized reasoning as a temporary training scaffold and gradually transfers it into hidden-state computation, eliminating explicit CoT at inference.
|
| 23 |
+
|
| 24 |
+
This checkpoint is built on the **UME-R1-7B** backbone (Qwen2-VL-7B architecture).
|
| 25 |
+
|
| 26 |
+
## Highlights
|
| 27 |
+
|
| 28 |
+
- **Universal**: a single model for text / image / video / visual-document embeddings.
|
| 29 |
+
- **Latent reasoning**: fewer than 10 latent steps replace hundreds of generated CoT tokens, giving **>30× faster inference** than explicit-CoT UME at comparable or better quality.
|
| 30 |
+
- **Strong retrieval**: evaluated on the 78-task **MMEB-v2** benchmark, outperforming strong explicit-CoT UME baselines — especially where evidence is dense and structurally complex (video and visual-document retrieval).
|
| 31 |
+
|
| 32 |
+
## Model details
|
| 33 |
+
|
| 34 |
+
- **Backbone**: [`zhibinlan/UME-R1-7B`](https://huggingface.co/zhibinlan/UME-R1-7B) (Qwen2-VL-7B, `Qwen2VLForConditionalGeneration`)
|
| 35 |
+
- **Parameters**: ~7B, weights in half precision (4 safetensors shards, ~17 GB)
|
| 36 |
+
- **License**: Apache-2.0
|
| 37 |
+
|
| 38 |
+
## Usage
|
| 39 |
+
|
| 40 |
+
The weights load as a standard Qwen2-VL checkpoint:
|
| 41 |
+
|
| 42 |
+
```python
|
| 43 |
+
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
|
| 44 |
+
|
| 45 |
+
model = Qwen2VLForConditionalGeneration.from_pretrained(
|
| 46 |
+
"Rem520/PLUME-7B", torch_dtype="auto", device_map="auto"
|
| 47 |
+
)
|
| 48 |
+
processor = AutoProcessor.from_pretrained("Rem520/PLUME-7B")
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
To use the full PLUME embedding pipeline (latent rollout + semantic-anchor-guided transition adapter), follow the official code: **https://github.com/haoxiangzhao12138/PLUME**
|
| 52 |
+
|
| 53 |
+
## Citation
|
| 54 |
+
|
| 55 |
+
```bibtex
|
| 56 |
+
@article{he2026plume,
|
| 57 |
+
title = {PLUME: Latent Reasoning Based Universal Multimodal Embedding},
|
| 58 |
+
author = {He, Chenwei and Hao, Xiangzhao and Yang, Tianyu and Ma, Yuxiang and
|
| 59 |
+
Jia, Yuheng and Wu, Lingxiang and Zhao, Chaoyang and Guo, Haiyun and Wang, Jinqiao},
|
| 60 |
+
journal = {arXiv preprint arXiv:2604.02073},
|
| 61 |
+
year = {2026}
|
| 62 |
+
}
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
- **Paper**: [arXiv:2604.02073](https://arxiv.org/abs/2604.02073)
|
| 66 |
+
- **Code**: [github.com/haoxiangzhao12138/PLUME](https://github.com/haoxiangzhao12138/PLUME)
|