--- pipeline_tag: text-classification tags: - vidore - reranker - qwen25_vl language: - multilingual base_model: - Qwen/Qwen2.5-VL-3B-Instruct inference: false library_name: transformers --- # MultimodalQwenLogitReranker-3B * Model Name: MultimodalQwenLogitReranker-3B * Model Type: Multilingual Multimodal Reranker * Base Model: Qwen/Qwen2.5-VL-3B-Instruct * Architecture Modifications: LoRA fine-tuned classifier on the "yes" vs "no" token logits, using sigmoid for scoring (inspired by Qwen text reranker : https://qwenlm.github.io/blog/qwen3-embedding) * Training Setup: Resource-constrained (single A100, batch size 2) ## Model Description QwenLogitReranker is a multilingual reranking model trained with a simple but effective strategy inspired by Alibaba Qwen Text Reranker. Instead of adding a classification head, it computes relevance scores using a sigmoid function on the logit difference between the tokens “yes” and “no.” This model is designed to be lightweight, general-purpose, and compatible with multimodal QwenVL. ## Training Details * Training Dataset: DocVQA (2,000 randomly sampled training examples) * Epochs: 1 * Batch Size: 2 * Negative Mining: In-batch hard negative * Loss Function: Binary classification (logit diff between “yes” and “no” passed through sigmoid) * Optimizer: AdamW * Fine-Tuning Method: LoRA + transformers trainer (with specific trick to deal with Qwen 2.5 pixel_values being unbatched) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66252d1725100e17022cc676/kHU-Rtd98W_ApLmdfjDkY.png) * Hardware: Single A100 GPU ## Evaluation Results (NDCG@5) | Dataset | Jina Reranker m0 (Baseline) | QwenLogitReranker | | ---------- | ------------------------ | ----------------- | | UlrickBL/vidore_benchmark_economics_reports_v2_reranker_adapted | 0.735 | **0.799** | | UlrickBL/vidore_benchmark_2_biomedical_lectures_v2_reranker_adapted | **0.763** | 0.755 | | UlrickBL/vidore_benchmark_2_esg_reports_human_labeled_v2_reranker_adapted | **0.851** | 0.820 | | UlrickBL/vidore_benchmark_arxivqa_reranker_adapted | **0.767** | 0.747 | | UlrickBL/vidore_benchmark_2_esg_reports_v2_reranker_adapted | **0.920** | **0.910** | | Inference time (4898*2810 image, T4 GPU) | 2.212 s | **1.161 s** | Note: Despite smaller training data, diversity of data and compute, QwenLogitReranker shows competitive or superior performance, especially in Economics. ## Limitations * Trained only on a small subset (2000 samples) of DocVQA * One epoch of training — performance could likely improve with more compute/data * Currently uses causal language model decoding to simulate classification; slower than embedding-based methods (making it bidertional like collama could improve performances but need compute) ## Load model ```python import torch from PIL import Image from torch import nn from peft import PeftModel, PeftConfig from huggingface_hub import hf_hub_download from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info class Qwen2_5Reranker(nn.Module): def __init__(self, base_model): super().__init__() self.base_model = base_model def forward(self, input_ids,pixel_values, attention_mask,image_grid_thw,original_length=None,labels=None): # Readapt pixel values if len(pixel_values.shape)==3 : pixel_values = pixel_values.transpose(0, 1).reshape(-1, pixel_values.shape[-1]) pixel_values = pixel_values[:original_length[0].item()] generated_ids = self.base_model.forward(input_ids=input_ids,pixel_values=pixel_values,image_grid_thw=image_grid_thw, attention_mask=attention_mask) logits =generated_ids.logits batch_size = logits.size(0) batch_indices = torch.arange(batch_size, device=logits.device) lengths = attention_mask.sum(dim=1) token_pos = lengths -1 token_id_yes = 9454 token_id_no = 2753 selected_logits = logits[batch_indices, token_pos] yes_logits = selected_logits[:, token_id_yes] # shape: [batch_size] no_logits = selected_logits[:, token_id_no] # shape: [batch_size] logit_diff = yes_logits - no_logits prob_yes = torch.sigmoid(logit_diff) return prob_yes # Load the model max_pixels = 1080*28*28 model_qwen = Qwen2_5_VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype=torch.bfloat16, device_map="auto", output_hidden_states=True, ) processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct",max_pixels=max_pixels) base = PeftModel.from_pretrained(model_qwen, "UlrickBL/qwen_vl_reranker_adapter_V2") model = Qwen2_5Reranker(base_model=base, hidden_dim=2048) model=model.to("cuda") model.eval() ``` ## Inference code ```python import time start_time = time.time() url = "https://oto.hms.harvard.edu/sites/g/files/omnuum8391/files/2025-04/PowerPoint-Presentation-Graphic.jpg" response = requests.get(url) image = Image.open(BytesIO(response.content)) query = "<|im_start|>system\nYou will be given an picture and a query. Answer 'Yes' if the answer to the query can be found in the picture, else 'No'<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Query : "+"What is the Harvard study departement in the question ?"+" \nAre the picture and query related ?<|im_end|>\n<|im_start|>assistant\n" inputs = processor( text=[query], images=[image], padding=True, return_tensors="pt", ) inputs.to("cuda") with torch.no_grad(): batch_scores = model(**inputs) end_time = time.time() print(f"Time taken : {end_time - start_time:.4f} seconds") ``` ## Future Work * Expand training with additional multilingual and domain-specific datasets * Increase batch size and epoch number * Compare with last hidden state + classification layer