Full ColBERT & H-Pool — Qwen2.5-VL-3B (ViDoRe)

This checkpoint supports two inference modes from the same weights: (1) Full ColBERT (uncompressed) — use all token-level vectors for late interaction; (2) H-Pool — a parameter-free compression that applies Ward hierarchical clustering at inference to reduce document vectors to a fixed budget (e.g. 64). No extra parameters; you switch behavior by setting pooling="colbert" or pooling="hierarchical_clustering". Weights are initialized from Qwen2.5-VL-3B-Instruct, finetuned on the ColPali train set for text-to-visual-document retrieval with bidirectional attention.

Method Overview

Full ColBERT keeps the full multi-vector representation (all token embeddings) and scores with ColBERT-style MaxSim. H-Pool compresses document tokens to a fixed number of vectors via Ward hierarchical clustering (cosine similarity → distance, then cluster and average within clusters); queries stay uncompressed. Both use the same checkpoint; only the pooling option changes.

Results on ViDoRe v2

Baseline (Ours, uncompressed) and H-Pool in the table below are from this checkpoint.

Method	Tokens	nDCG@5 (Avg)	Bio	Econ	ESG-R	ESG-H
ColPali	–	53.3	56.5	49.9	55.7	51.1
ColQwenOmni	–	56.5	56.5	53.2	54.2	62.2
MetaEmbed	64	58.8	58.7	55.5	57.4	63.7
Baseline (Ours, uncompressed)	1297	60.0	61.4	53.9	57.0	67.6
SeqResize	64	51.7	54.7	53.5	45.2	53.5
MemTok	64	54.3	56.8	53.0	46.4	61.4
H-Pool (this checkpoint)	64	56.4	59.6	52.1	53.4	60.6
AGC	64	56.7	59.0	54.5	55.8	57.3

Model Details


Initial weights	Qwen2.5-VL-3B-Instruct
Architecture	Qwen2.5-VL with bidirectional attention
Hidden dimension	2048
Pooling	`colbert` (full) or `hierarchical_clustering` (H-Pool)
Budget	H-Pool: 64 vectors per document
Scoring	ColBERT-style MaxSim (late interaction)
Normalization	L2-normalized embeddings
Query prefix	`"Query: "`
Passage prefix	`"Passage: "`
Precision	bfloat16
Max image tokens	1280

Usage

Use Full ColBERT (uncompressed) with pooling="colbert", or H-Pool with pooling="hierarchical_clustering" and num_repr_vectors=64. Same checkpoint; only the pooling argument changes.

import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info

from src.arguments import ModelArguments
from src.encoder.multivec_encoder import MultiVecEncoder
from src.models.qwen2_5_vl_embed.qwen2_5_vl_embed import Qwen2_5ForEmbedding

MODEL_ID = "hltcoe/ColBERT_qwen2.5-vl_colpali"
IMAGE_PATH = "PLACEHOLDER"

# Full (uncompressed) ColBERT:
# model_args = ModelArguments(model_name_or_path=MODEL_ID, pooling="colbert", normalize=True, attn_implementation="flash_attention_2")
# H-Pool (64 vectors):
model_args = ModelArguments(
    model_name_or_path=MODEL_ID,
    pooling="hierarchical_clustering",
    normalize=True,
    num_repr_vectors=64,
    attn_implementation="flash_attention_2",
)

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = MultiVecEncoder.load(
    Qwen2_5ForEmbedding,
    model_args,
    attn_implementation=model_args.attn_implementation,
    dtype=torch.bfloat16,
)
model = model.to("cuda").eval()

# --- Encode an image document ---
passage_messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Passage: "},
            {"type": "image", "image": IMAGE_PATH, "max_pixels": 1003520, "min_pixels": 614656},
        ],
    }
]
text = processor.apply_chat_template(passage_messages, tokenize=False, add_generation_prompt=False)
image_inputs, video_inputs = process_vision_info(passage_messages)
passage_inputs = processor(
    text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt",
).to("cuda")

with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    with torch.inference_mode():
        doc_embeddings, doc_mask = model.encode(passage_inputs, is_query=False)
        print(doc_embeddings.shape)
        # colbert: (1, seq_len, 2048); hierarchical_clustering: (1, 64, 2048)

# --- Encode a text query ---
query_messages = [{"role": "user", "content": [{"type": "text", "text": "Query: What types of tissues are unable to regenerate spontaneously?"}]}]
query_text = processor.apply_chat_template(query_messages, tokenize=False, add_generation_prompt=False)
query_inputs = processor(text=[query_text], padding=True, return_tensors="pt").to("cuda")

with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    with torch.inference_mode():
        query_embeddings, query_mask = model.encode(query_inputs, is_query=True)
        print(query_embeddings.shape)

# --- ColBERT MaxSim scoring ---
score = model.compute_similarity(query_embeddings, doc_embeddings, query_mask, doc_mask)
print(f"Similarity score: {score.item():.4f}")

Command line usage

For running inference and evaluation from the command line, see the Quick Start section.

Citation

@misc{qin2026multivectorindexcompressionmodality,
      title={Multi-Vector Index Compression in Any Modality}, 
      author={Hanxiang Qin and Alexander Martin and Rohan Jha and Chunsheng Zuo and Reno Kriz and Benjamin Van Durme},
      year={2026},
      eprint={2602.21202},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2602.21202}, 
}