Feature Extraction
Transformers
Safetensors
English
qwen2_vl
image-text-to-text
multimodal-embedding
universal-multimodal-embedding
retrieval
latent-reasoning
mllm
qwen2-vl
Instructions to use Rem520/PLUME-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Rem520/PLUME-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Rem520/PLUME-7B")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Rem520/PLUME-7B") model = AutoModelForImageTextToText.from_pretrained("Rem520/PLUME-7B") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| base_model: | |
| - zhibinlan/UME-R1-7B | |
| language: | |
| - en | |
| tags: | |
| - multimodal-embedding | |
| - universal-multimodal-embedding | |
| - retrieval | |
| - latent-reasoning | |
| - mllm | |
| - qwen2-vl | |
| pipeline_tag: feature-extraction | |
| library_name: transformers | |
| # PLUME-7B | |
| **PLUME** (Latent Reasoning Based Universal Multimodal Embedding) is a 7B universal multimodal embedding model that maps heterogeneous inputs — text, images, videos, and visual documents — into a single shared retrieval space. | |
| Recent universal multimodal embedding (UME) methods improve retrieval by generating explicit chain-of-thought (CoT) rationales before extracting an embedding. This is effective but slow, and it forces rich multimodal evidence through a narrow textual bottleneck. PLUME instead replaces verbalized CoT with a **short autoregressive rollout of continuous latent states**, and uses a **semantic-anchor-guided transition adapter** to steer the latent computation along input-dependent reasoning trajectories under a fixed compute budget. The model is trained with a **progressive explicit-to-latent curriculum** that uses verbalized reasoning as a temporary training scaffold and gradually transfers it into hidden-state computation, eliminating explicit CoT at inference. | |
| This checkpoint is built on the **UME-R1-7B** backbone (Qwen2-VL-7B architecture). | |
| ## Highlights | |
| - **Universal**: a single model for text / image / video / visual-document embeddings. | |
| - **Latent reasoning**: fewer than 10 latent steps replace hundreds of generated CoT tokens, giving **>30× faster inference** than explicit-CoT UME at comparable or better quality. | |
| - **Strong retrieval**: evaluated on the 78-task **MMEB-v2** benchmark, outperforming strong explicit-CoT UME baselines — especially where evidence is dense and structurally complex (video and visual-document retrieval). | |
| ## Model details | |
| - **Backbone**: [`zhibinlan/UME-R1-7B`](https://huggingface.co/zhibinlan/UME-R1-7B) (Qwen2-VL-7B, `Qwen2VLForConditionalGeneration`) | |
| - **Parameters**: ~7B, weights in half precision (4 safetensors shards, ~17 GB) | |
| - **License**: Apache-2.0 | |
| ## Usage | |
| The weights load as a standard Qwen2-VL checkpoint: | |
| ```python | |
| from transformers import AutoProcessor, Qwen2VLForConditionalGeneration | |
| model = Qwen2VLForConditionalGeneration.from_pretrained( | |
| "Rem520/PLUME-7B", torch_dtype="auto", device_map="auto" | |
| ) | |
| processor = AutoProcessor.from_pretrained("Rem520/PLUME-7B") | |
| ``` | |
| To use the full PLUME embedding pipeline (latent rollout + semantic-anchor-guided transition adapter), follow the official code: **https://github.com/haoxiangzhao12138/PLUME** | |
| ## Citation | |
| ```bibtex | |
| @article{he2026plume, | |
| title = {PLUME: Latent Reasoning Based Universal Multimodal Embedding}, | |
| author = {He, Chenwei and Hao, Xiangzhao and Yang, Tianyu and Ma, Yuxiang and | |
| Jia, Yuheng and Wu, Lingxiang and Zhao, Chaoyang and Guo, Haiyun and Wang, Jinqiao}, | |
| journal = {arXiv preprint arXiv:2604.02073}, | |
| year = {2026} | |
| } | |
| ``` | |
| - **Paper**: [arXiv:2604.02073](https://arxiv.org/abs/2604.02073) | |
| - **Code**: [github.com/haoxiangzhao12138/PLUME](https://github.com/haoxiangzhao12138/PLUME) | |