Add metadata and improve model card (#1)

6cf1262 10 days ago

2.21 kB

base_model:
  - Qwen/Qwen2-VL-2B-Instruct
datasets:
  - VLM2Vec/MMEB-V2
language:
  - en
library_name: transformers
pipeline_tag: feature-extraction

PLUME-Qwen2-VL-2B

PLUME: Latent Reasoning Based Universal Multimodal Embedding

PLUME is a latent reasoning framework for universal multimodal embedding (UME). It replaces explicit chain-of-thought (CoT) generation with a short autoregressive rollout of continuous latent states, achieving stronger retrieval performance while delivering over 30x faster inference compared to explicit-CoT methods.

Project Page | Paper | Code

Highlights

Replaces hundreds of explicit reasoning tokens with only 8 latent steps
30.3x faster inference than UME-R1 (298ms vs 9023ms per sample)
61.6 overall on the 78-task MMEB-v2 benchmark, surpassing UME-R1 (60.1) and VLM2Vec-V2 (58.0)
Particularly strong on Video (+1.9 vs UME-R1) and Visual Document (+3.6 vs UME-R1) retrieval

Results on MMEB-v2

Model	Image	Video	VisDoc	All
VLM2Vec-V2	64.9	34.9	65.4	58.0
UME-R1	66.6	42.2	63.9	60.1
PLUME	66.3	44.1	67.5	61.6

Usage

See the full training and evaluation pipeline at: https://github.com/haoxiangzhao12138/PLUME

Download

# Option 1: huggingface-cli
huggingface-cli download CUDAOUTOFMEMORY/PLUME-Qwen2-VL-2B --local-dir /path/to/model

# Option 2: git clone (requires git-lfs)
git lfs install
git clone https://huggingface.co/CUDAOUTOFMEMORY/PLUME-Qwen2-VL-2B

Citation

@misc{he2026plumelatentreasoningbased,
      title={PLUME: Latent Reasoning Based Universal Multimodal Embedding},
      author={Chenwei He and Xiangzhao Hao and Tianyu Yang and Yuxiang Ma and Yuheng Jia and Lingxiang Wu and Chaoyang Zhao and Haiyun Guo and Jinqiao Wang},
      year={2026},
      eprint={2604.02073},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.02073},
}