|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-7B-Instruct |
|
|
--- |
|
|
<a href="https://github.com/vec-ai/lychee-rerank-mm"> |
|
|
<img src="https://img.shields.io/badge/GitHub-%23121011.svg?logo=github&logoColor=white"> |
|
|
</a> |
|
|
|
|
|
# Lychee-rerank-mm |
|
|
|
|
|
`Lychee-rerank-mm` is the latest generalist multimodal reranking model developed based on the `Qwen2.5-VL-Instruct` foundation model. It is designed for reranking tasks in image-text multimodal retrieval scenarios. |
|
|
`Lychee-rerank-mm` is jointly developed by the NLP Team of Harbin Institute of Technology, Shenzhen, and the 7B parameter versions are released as open source. |
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
**Lychee-rerank-mm**: |
|
|
|
|
|
- Model Type: Multimodal Reranking |
|
|
- Language Support: en |
|
|
- Param Size: 7B |
|
|
- Model Precision: BF16 |
|
|
|
|
|
For more details, please refer to our paper. |
|
|
|
|
|
|
|
|
### Model List |
|
|
|
|
|
| Model Type | Models | Size | Instruction Aware | |
|
|
|------------------------|----------------------|------|----------------------| |
|
|
| Multimodal Reranking | [lychee-rerank-mm](https://huggingface.co/vec-ai/lychee-rerank-mm) | 8.29B | Yes | |
|
|
|
|
|
> **Note**: |
|
|
> - `Instruction Aware` notes whether the reranking model supports customizing the input instruction according to different tasks. |
|
|
> - Like most models, for most downstream tasks, using instructions (instruct) typically yields an improvement to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios. |
|
|
|
|
|
|
|
|
## Model Usage |
|
|
|
|
|
📌 **Tips**: We recommend that developers customize the `instruction` according to their specific scenarios. |
|
|
|
|
|
|
|
|
### Transformers Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor |
|
|
from qwen_vl_utils import process_vision_info |
|
|
|
|
|
|
|
|
def format_content(text, image, prefix='Query:'): |
|
|
content = [] |
|
|
if not text and not image: |
|
|
content = [{'type': 'text', 'text': prefix}] |
|
|
return content |
|
|
content.append({'type': 'text', 'text': prefix}) |
|
|
if image: |
|
|
content.append({'type': 'image', 'image': 'file://' + image}) |
|
|
if text: |
|
|
content.append({'type': 'text', 'text': text}) |
|
|
return content |
|
|
|
|
|
def format_instruction(instruction, query_text, query_image_path, doc_text, doc_image_path): |
|
|
inputs = [] |
|
|
inputs.append({ |
|
|
"role": "system", |
|
|
"content": [{ |
|
|
"type": "text", |
|
|
"text": "Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\"." |
|
|
} |
|
|
] |
|
|
}) |
|
|
contents = [] |
|
|
contents.append({ |
|
|
"type": "text", |
|
|
"text": '<Instruct>: ' + instruction |
|
|
}) |
|
|
query_content = format_content(query_text, query_image_path, prefix='<Query>:') |
|
|
contents.extend(query_content) |
|
|
doc_content = format_content(doc_text, doc_image_path, prefix='\n<Document>:') |
|
|
contents.extend(doc_content) |
|
|
inputs.append({ |
|
|
"role": "user", |
|
|
"content": contents |
|
|
}) |
|
|
return inputs |
|
|
|
|
|
|
|
|
def process_inputs(pairs): |
|
|
texts = [tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) for messages in pairs] |
|
|
try: |
|
|
image_inputs, video_inputs = process_vision_info(pairs) |
|
|
except Exception as e: |
|
|
print(f'Failed to load image, consider to remove it from the dataset') |
|
|
inputs = tokenizer( |
|
|
text=texts, |
|
|
images=image_inputs, |
|
|
videos=video_inputs, |
|
|
padding=True, |
|
|
return_tensors="pt", |
|
|
truncation=False, |
|
|
max_length=3200 |
|
|
) |
|
|
for key in inputs: |
|
|
inputs[key] = inputs[key].to(model.device) |
|
|
return inputs |
|
|
|
|
|
|
|
|
@torch.no_grad() |
|
|
def compute_logits(inputs, **kwargs): |
|
|
batch_scores = model(**inputs).logits[:, -1, :] |
|
|
true_vector = batch_scores[:, token_true_id] |
|
|
false_vector = batch_scores[:, token_false_id] |
|
|
batch_scores = torch.stack([false_vector, true_vector], dim=1) |
|
|
batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1) |
|
|
scores = batch_scores[:, 1].exp().tolist() |
|
|
return scores |
|
|
|
|
|
|
|
|
model_name_or_path = "vec-ai/lychee-rerank-mm" |
|
|
min_pixels = 4*28*28 |
|
|
max_pixels = 1280*28*28 |
|
|
tokenizer = AutoProcessor.from_pretrained(model_name_or_path, padding_side='left', min_pixels=min_pixels, max_pixels=max_pixels, trust_remote_code=True) |
|
|
# We recommend enabling flash_attention_2 for better acceleration and memory saving. |
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_name_or_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2").cuda().eval() |
|
|
|
|
|
token_false_id = tokenizer.tokenizer.get_vocab()["no"] |
|
|
token_true_id = tokenizer.tokenizer.get_vocab()["yes"] |
|
|
|
|
|
task = 'Given a web search query, retrieve relevant passages that answer the query' |
|
|
|
|
|
query_texts = [ |
|
|
"What is the capital of China?", |
|
|
"Explain gravity", |
|
|
] |
|
|
|
|
|
query_images = [ |
|
|
None, |
|
|
None, |
|
|
] |
|
|
|
|
|
doc_texts = [ |
|
|
"The capital of China is Beijing.", |
|
|
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.", |
|
|
] |
|
|
|
|
|
doc_images = [ |
|
|
None, |
|
|
None, |
|
|
] |
|
|
|
|
|
pairs = [format_instruction(task, query_text, query_image, doc_text, doc_image) for query_text, query_image, doc_text, doc_image in zip(query_texts, query_images, doc_texts, doc_images)] |
|
|
|
|
|
# Tokenize the input texts |
|
|
inputs = process_inputs(pairs) |
|
|
scores = compute_logits(inputs) |
|
|
|
|
|
print("scores: ", scores) |
|
|
|
|
|
query_text = "What breed is the cat in the image?" |
|
|
query_image = "./images/Siamese_cat1.jpg" |
|
|
doc_texts = [ |
|
|
"The Siamese cat is one of the first distinctly recognised breeds of Asian cat. It derives from the Wichianmat landrace. The Siamese cat is one of several varieties of cats native to Thailand (known as Siam before 1939). The original Siamese became one of the most popular breeds in Europe and North America in the 19th century. Siamese cats have a distinctive colourpoint coat, resulting from a temperature-sensitive type of albinism.", |
|
|
"The Asian or Asian group, is a cat breed similar to the European Burmese, but comes in a range of different coat colours and patterns. Long-haired Asians of all varieties are called Tiffanies. Asians are grouped in section 5 (Burmese) by the Governing Council of the Cat Fancy (GCCF)." |
|
|
] |
|
|
doc_images = [ |
|
|
"./images/Siamese_cat2.jpg", |
|
|
"./images/Asian_cat.jpg", |
|
|
] |
|
|
pairs = [format_instruction(task, query_text, query_image, doc_text, doc_image) for doc_text, doc_image in zip(doc_texts, doc_images)] |
|
|
inputs = process_inputs(pairs) |
|
|
scores = compute_logits(inputs) |
|
|
print("scores: ", scores) |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
## Evaluation |
|
|
|
|
|
| Model | Param | ALL (40) | T→T (14) | I→I (1) | T→I (4) | T→VD (5) | I→T (5) | T→IT (2) | IT→T (4) | IT→I (2) | IT→IT (3) | |
|
|
|-------------|-------|----------|----------|---------|---------|----------|---------|----------|----------|----------|-----------| |
|
|
| GME-2B | 2.21B | 52.54 | 49.59 | 30.75 | 48.46 | 66.39 | 52.62 | 77.02 | 39.88 | 36.70 | 66.89 | |
|
|
|| |
|
|
| Qwen3-Reranker | 4.02B | -- | 60.49| -- | -- | -- | -- | -- | -- | -- | -- | |
|
|
| Jina-rerank-m0 | 2.21B | 54.36 | 55.36 | 27.50 | 59.46| 73.13| 55.43 | 74.95 | 27.82 | 37.65 | 51.54 | |
|
|
| MonoQwen2-VL-v0.1 | 2.21B | 44.20 | 48.89 | 12.59 | 58.73 | 71.29 | 19.62 | 76.46 | 14.35 | 31.75 | 35.83 | |
|
|
|| |
|
|
| **lychee-rerank-mm-3B** | 3.75B | 61.40| 59.22 | 29.76| 58.85 | 72.38 | 63.06| 81.96| 48.81| 43.97| 79.08 | |
|
|
| **lychee-rerank-mm-7B** | 8.29B | 63.85| 61.08| 32.83| 61.18| 72.94| 66.61| 84.55| 53.29| 47.39| 82.19 | |
|
|
|
|
|
For more details, please refer to our paper. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our work helpful, feel free to give us a cite. |
|
|
``` |
|
|
@misc{dai2025supervisedfinetuningcontrastivelearning, |
|
|
title={Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking}, |
|
|
author={Ziqi Dai and Xin Zhang and Mingxin Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Meishan Zhang and Wenjie Li and Min Zhang}, |
|
|
year={2025}, |
|
|
eprint={2510.14824}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2510.14824}, |
|
|
} |
|
|
``` |