File size: 6,824 Bytes
ab00240 f0a7444 ab00240 42a9504 ab00240 7557720 37811b1 ab00240 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | ---
license: apache-2.0
datasets:
- xlangai/BRIGHT
- liuwenhan/reasonrank_data_13k
language:
- en
base_model:
- Qwen/Qwen2.5-7B-Instruct
pipeline_tag: text-ranking
library_name: transformers
tags:
- passage ranking
- text-ranking
- reasoning
- Information-Retrieval
---
## Introduction
This is the model trained in our paper: GroupRank: A Groupwise Reranking Paradigm
Driven by Reinforcement Learning ([📝arXiv](https://arxiv.org/pdf/2511.11653)).
Please refer our [🧩github repository](https://github.com/AQ-MedAI/Diver) for the usage of GroupRank-32B.
## Highlights
GroupRank is a reinforcement-learning-powered reranker that breaks the traditional **“pointwise vs. listwise”** trade-off: it feeds the query together with a small group of candidates to an LLM, lets the model perform within-group comparisons, and outputs individual relevance scores—combining the flexibility of pointwise scoring with the contextual awareness of listwise ranking.
Training is driven by GRPO and a heterogeneous reward (NDCG + distributional alignment) that keeps scores consistent across groups, while a synthetic data pipeline eliminates the need for large human-labeled sets.
On the reasoning-heavy BRIGHT and R2MED benchmarks, GroupRank sets (7B / 32B) new SOTA **NDCG@10 (46.8 / 52.3)** with strong generalization to classic retrieval tasks.
## Model Performance
<!-- <p align="center">
<img width="90%" alt="image" src="https://8421bcd.oss-cn-beijing.aliyuncs.com/img/image-20250810163757771.png" />
</p> -->
**BRIGHT 基准测试结果 (NDCG@10)**
| Models | Bio. | Earth. | Econ. | Psy. | Rob. | Stack. | Sus. | Leet. | Pony | AoPS | TheoQ. | TheoT. | Avg. |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **Diver-Retriever-4B** | 52.52 | 53.59 | 33.75 | 45.16 | 28.38 | 30.39 | 35.00 | 12.97 | 14.74 | 9.83 | 42.51 | 36.32 | 32.93 |
| **_Non-reasoning reranker_** | | | | | | | | | | | | | |
| rankT5 (3B) | 33.01 | 22.80 | 18.96 | 8.61 | 2.28 | 10.07 | 23.94 | 11.97 | 34.93 | 8.96 | 19.69 | 11.70 | 17.24 |
| RankZephyr (7B) | 42.60 | 19.51 | 18.79 | 29.85 | 13.98 | 13.33 | 29.29 | 12.97 | 31.37 | 7.44 | 31.19 | 31.67 | 23.50 |
| **_Reasoning reranker_** | | | | | | | | | | | | | |
| Rank-R1 (7B) | 40.85 | 31.38 | 23.17 | 32.04 | 19.82 | 11.23 | 35.53 | 3.77 | 6.60 | 4.99 | 17.20 | 30.10 | 21.39 |
| Rank-R1 (14B) | 49.57 | 41.15 | 27.56 | 40.04 | 28.46 | 28.19 | 43.76 | 6.96 | 18.45 | 7.88 | 34.80 | 43.11 | 30.83 |
| Rank-K (32B) | 51.06 | 42.34 | 32.95 | 44.52 | 33.07 | 28.44 | 41.57 | 12.82 | **21.62** | 8.48 | 39.31 | 43.15 | 33.28 |
| ReasonRank (7B)-w20s10 | 51.63 | 43.43 | 32.40 | 43.99 | 31.02 | 25.63 | 39.81 | 15.38 | 20.07 | 6.95 | 38.91 | 40.69 | 32.49 |
| ReasonRank (32B)-w20s10 | 53.89 | 47.59 | 36.34 | **52.64** | 36.48 | 34.17 | **44.47** | 15.21 | 14.81 | 5.47 | 40.63 | 45.29 | 35.58 |
| GroupRank-7B | 56.85 | 53.06 | 35.94 | 48.75 | 36.65 | 35.05 | 43.99 | 15.48 | 15.53 | 10.64 | 41.45 | 46.40 | 36.65 |
| GroupRank-32B | **59.48** | **56.49** | **40.12** | 50.46 | **38.36** | **39.16** | 43.32 | **19.48** | 16.32 | **13.34** | **46.39** | **47.91** | **39.24** |
### inference
```python
from vllm import LLM, SamplingParams
from collections import defaultdict
from transformers import AutoTokenizer
import random
import torch
random.seed(666)
sys_prompt = '''Your task is to evaluate and rank documents based on how well they help answer the given query. Follow this evaluation priority:
1. PRIMARY: Usefulness & Helpfulness - Does the document provide actionable information, solutions, or direct answers that help address the user's needs?
2. SECONDARY: Relevance - Does the document contain information related to the query topic?
Evaluation Process:
1. First, identify the user's core intent and what kind of help they need from the query
2. For each document, assess:
- How directly it addresses the user's intent
- What actionable information or answers it provides
- How much it helps solve the user's problem or need
3. Compare documents against each other to ensure proper ranking
4. Assign scores that reflect the relative usefulness ranking
Scoring Scale (0-10):
- 9-10: Extremely helpful, directly answers the query with actionable information
- 7-8: Very helpful, provides substantial useful information for the query
- 5-6: Moderately helpful, contains some useful information but incomplete
- 3-4: Minimally helpful, limited useful information despite topic relevance
- 1-2: Barely helpful, mentions related topics but provides little useful information
- 0: Not helpful at all, cannot assist with answering the query
'''
user_prompt = '''I will provide you {TOPK} documents, each indicated by a numerical identifier []. Score these documents based on their Usefulness and Relevance to the query.
Query:
{QUERY}
Documents:
{PASSAGES}
## Final Output Format
You must structure your response in exactly two parts: provide your brief reasoning process first, then output final scores in JSON format like below, with document IDs as string keys and integer scores as values for all {TOPK} documents.
The reasoning process and answer are enclosed within <reason> </reason> and <answer> </answer> tags, respectively. Do NOT output anything outside the specified tags. Follow this exact format:
<reason>
[Analyze each document's usefulness and relevance to the query, explaining your scoring rationale]
</reason>
<answer>
\```json
{{"[1]": 5, "[2]": 3, "[3]": 8}}
\```
</answer>
'''
class GroupReranker:
def __init__(self, model_path, sys_prompt, user_prompt) -> None:
# vllm offline inference
self.llm = LLM(model=model_path, dtype="bfloat16", gpu_memory_utilization=0.9, tensor_parallel_size=torch.cuda.device_count(), max_model_len=32000)
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.sampling_params = SamplingParams(temperature=0.3, top_p=0.8, max_tokens=8000, logprobs=10)
self.group_system_prompt = sys_prompt
self.group_user_prompt = user_prompt
def rerank(self, query, doc_list):
docs_str = ''.join(["[{}]. {}\n\n".format(idx+1, doc_text) for idx, doc_text in enumerate(doc_list)])
group_texts = self.group_user_prompt.format(QUERY=query, PASSAGES=docs_str, TOPK=len(doc_list))
message = self.tokenizer.apply_chat_template(
[{'role': 'system', 'content': self.group_system_prompt},
{'role': 'user', 'content': group_texts}], tokenize=False, add_generation_prompt=True)
output = self.llm.generate(message, self.sampling_params, use_tqdm=True)
return output
```
🌹 If you use this model, please ✨star our <a href="https://github.com/AQ-MedAI/Diver" target="_blank">GitHub repository</a> to support us.
Your star means a lot! |