PLUME-7B / README.md

Add model card (PLUME-7B: latent-reasoning universal multimodal embedding)

d9d2239 verified 2 days ago

3.15 kB

	---
	license: apache-2.0
	base_model:
	- zhibinlan/UME-R1-7B
	language:
	- en
	tags:
	- multimodal-embedding
	- universal-multimodal-embedding
	- retrieval
	- latent-reasoning
	- mllm
	- qwen2-vl
	pipeline_tag: feature-extraction
	library_name: transformers
	---

	# PLUME-7B

	PLUME (Latent Reasoning Based Universal Multimodal Embedding) is a 7B universal multimodal embedding model that maps heterogeneous inputs — text, images, videos, and visual documents — into a single shared retrieval space.

	Recent universal multimodal embedding (UME) methods improve retrieval by generating explicit chain-of-thought (CoT) rationales before extracting an embedding. This is effective but slow, and it forces rich multimodal evidence through a narrow textual bottleneck. PLUME instead replaces verbalized CoT with a short autoregressive rollout of continuous latent states, and uses a semantic-anchor-guided transition adapter to steer the latent computation along input-dependent reasoning trajectories under a fixed compute budget. The model is trained with a progressive explicit-to-latent curriculum that uses verbalized reasoning as a temporary training scaffold and gradually transfers it into hidden-state computation, eliminating explicit CoT at inference.

	This checkpoint is built on the UME-R1-7B backbone (Qwen2-VL-7B architecture).

	## Highlights

	- Universal: a single model for text / image / video / visual-document embeddings.
	- Latent reasoning: fewer than 10 latent steps replace hundreds of generated CoT tokens, giving >30× faster inference than explicit-CoT UME at comparable or better quality.
	- Strong retrieval: evaluated on the 78-task MMEB-v2 benchmark, outperforming strong explicit-CoT UME baselines — especially where evidence is dense and structurally complex (video and visual-document retrieval).

	## Model details

	- Backbone: [`zhibinlan/UME-R1-7B`](https://huggingface.co/zhibinlan/UME-R1-7B) (Qwen2-VL-7B, `Qwen2VLForConditionalGeneration`)
	- Parameters: ~7B, weights in half precision (4 safetensors shards, ~17 GB)
	- License: Apache-2.0

	## Usage

	The weights load as a standard Qwen2-VL checkpoint:

	```python
	from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

	model = Qwen2VLForConditionalGeneration.from_pretrained(
	"Rem520/PLUME-7B", torch_dtype="auto", device_map="auto"
	)
	processor = AutoProcessor.from_pretrained("Rem520/PLUME-7B")
	```

	To use the full PLUME embedding pipeline (latent rollout + semantic-anchor-guided transition adapter), follow the official code: https://github.com/haoxiangzhao12138/PLUME

	## Citation

	```bibtex
	@article{he2026plume,
	title = {PLUME: Latent Reasoning Based Universal Multimodal Embedding},
	author = {He, Chenwei and Hao, Xiangzhao and Yang, Tianyu and Ma, Yuxiang and
	Jia, Yuheng and Wu, Lingxiang and Zhao, Chaoyang and Guo, Haiyun and Wang, Jinqiao},
	journal = {arXiv preprint arXiv:2604.02073},
	year = {2026}
	}
	```

	- Paper: [arXiv:2604.02073](https://arxiv.org/abs/2604.02073)
	- Code: [github.com/haoxiangzhao12138/PLUME](https://github.com/haoxiangzhao12138/PLUME)