File size: 6,824 Bytes

---
license: apache-2.0
datasets:
- xlangai/BRIGHT
- liuwenhan/reasonrank_data_13k
language:
- en
base_model:
- Qwen/Qwen2.5-7B-Instruct
pipeline_tag: text-ranking
library_name: transformers
tags:
- passage ranking
- text-ranking
- reasoning
- Information-Retrieval
---


## Introduction

This is the model trained in our paper: GroupRank: A Groupwise Reranking Paradigm
Driven by Reinforcement Learning ([📝arXiv](https://arxiv.org/pdf/2511.11653)). 
Please refer our [🧩github repository](https://github.com/AQ-MedAI/Diver) for the usage of GroupRank-32B.


## Highlights

GroupRank is a reinforcement-learning-powered reranker that breaks the traditional **“pointwise vs. listwise”** trade-off: it feeds the query together with a small group of candidates to an LLM, lets the model perform within-group comparisons, and outputs individual relevance scores—combining the flexibility of pointwise scoring with the contextual awareness of listwise ranking. 
Training is driven by GRPO and a heterogeneous reward (NDCG + distributional alignment) that keeps scores consistent across groups, while a synthetic data pipeline eliminates the need for large human-labeled sets. 
On the reasoning-heavy BRIGHT and R2MED benchmarks, GroupRank sets (7B / 32B) new SOTA **NDCG@10 (46.8 / 52.3)** with strong generalization to classic retrieval tasks.

## Model Performance

<!-- <p align="center">
<img width="90%" alt="image" src="https://8421bcd.oss-cn-beijing.aliyuncs.com/img/image-20250810163757771.png" />
</p> -->

**BRIGHT 基准测试结果 (NDCG@10)**

| Models | Bio. | Earth. | Econ. | Psy. | Rob. | Stack. | Sus. | Leet. | Pony | AoPS | TheoQ. | TheoT. | Avg. |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **Diver-Retriever-4B** | 52.52 | 53.59 | 33.75 | 45.16 | 28.38 | 30.39 | 35.00 | 12.97 | 14.74 | 9.83 | 42.51 | 36.32 | 32.93 |
| **_Non-reasoning reranker_** | | | | | | | | | | | | | |
| rankT5 (3B) | 33.01 | 22.80 | 18.96 | 8.61 | 2.28 | 10.07 | 23.94 | 11.97 | 34.93 | 8.96 | 19.69 | 11.70 | 17.24 |
| RankZephyr (7B) | 42.60 | 19.51 | 18.79 | 29.85 | 13.98 | 13.33 | 29.29 | 12.97 | 31.37 | 7.44 | 31.19 | 31.67 | 23.50 |
| **_Reasoning reranker_** | | | | | | | | | | | | | |
| Rank-R1 (7B) | 40.85 | 31.38 | 23.17 | 32.04 | 19.82 | 11.23 | 35.53 | 3.77 | 6.60 | 4.99 | 17.20 | 30.10 | 21.39 |
| Rank-R1 (14B) | 49.57 | 41.15 | 27.56 | 40.04 | 28.46 | 28.19 | 43.76 | 6.96 | 18.45 | 7.88 | 34.80 | 43.11 | 30.83 |
| Rank-K (32B) | 51.06 | 42.34 | 32.95 | 44.52 | 33.07 | 28.44 | 41.57 | 12.82 | **21.62** | 8.48 | 39.31 | 43.15 | 33.28 |
| ReasonRank (7B)-w20s10 | 51.63 | 43.43 | 32.40 | 43.99 | 31.02 | 25.63 | 39.81 | 15.38 | 20.07 | 6.95 | 38.91 | 40.69 | 32.49 |
| ReasonRank (32B)-w20s10 | 53.89 | 47.59 | 36.34 | **52.64** | 36.48 | 34.17 | **44.47** | 15.21 | 14.81 | 5.47 | 40.63 | 45.29 | 35.58 |
| GroupRank-7B | 56.85 | 53.06 | 35.94 | 48.75 | 36.65 | 35.05 | 43.99 | 15.48 | 15.53 | 10.64 | 41.45 | 46.40 | 36.65 |
| GroupRank-32B | **59.48** | **56.49** | **40.12** | 50.46 | **38.36** | **39.16** | 43.32 | **19.48** | 16.32 | **13.34** | **46.39** | **47.91** | **39.24** |


### inference

```python
from vllm import LLM, SamplingParams
from collections import defaultdict
from transformers import AutoTokenizer
import random
import torch

random.seed(666)

sys_prompt = '''Your task is to evaluate and rank documents based on how well they help answer the given query. Follow this evaluation priority:
1. PRIMARY: Usefulness & Helpfulness - Does the document provide actionable information, solutions, or direct answers that help address the user's needs?
2. SECONDARY: Relevance - Does the document contain information related to the query topic?

Evaluation Process:
1. First, identify the user's core intent and what kind of help they need from the query
2. For each document, assess:
   - How directly it addresses the user's intent
   - What actionable information or answers it provides
   - How much it helps solve the user's problem or need
3. Compare documents against each other to ensure proper ranking
4. Assign scores that reflect the relative usefulness ranking

Scoring Scale (0-10):
- 9-10: Extremely helpful, directly answers the query with actionable information
- 7-8: Very helpful, provides substantial useful information for the query
- 5-6: Moderately helpful, contains some useful information but incomplete
- 3-4: Minimally helpful, limited useful information despite topic relevance
- 1-2: Barely helpful, mentions related topics but provides little useful information
- 0: Not helpful at all, cannot assist with answering the query
'''

user_prompt = '''I will provide you {TOPK} documents, each indicated by a numerical identifier []. Score these documents based on their Usefulness and Relevance to the query.
Query:
{QUERY}

Documents:
{PASSAGES}

## Final Output Format
You must structure your response in exactly two parts: provide your brief reasoning process first, then output final scores in JSON format like below, with document IDs as string keys and integer scores as values for all {TOPK} documents. 
The reasoning process and answer are enclosed within <reason> </reason> and <answer> </answer> tags, respectively. Do NOT output anything outside the specified tags. Follow this exact format:
<reason> 
[Analyze each document's usefulness and relevance to the query, explaining your scoring rationale]
</reason>
<answer> 
\```json
{{"[1]": 5, "[2]": 3, "[3]": 8}}
\``` 
</answer>
'''

class GroupReranker:
    def __init__(self, model_path, sys_prompt, user_prompt) -> None:
        # vllm offline inference
        self.llm = LLM(model=model_path, dtype="bfloat16", gpu_memory_utilization=0.9, tensor_parallel_size=torch.cuda.device_count(), max_model_len=32000)
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.sampling_params = SamplingParams(temperature=0.3, top_p=0.8, max_tokens=8000, logprobs=10)

        self.group_system_prompt = sys_prompt
        self.group_user_prompt = user_prompt

    def rerank(self, query, doc_list):
        docs_str = ''.join(["[{}]. {}\n\n".format(idx+1, doc_text) for idx, doc_text in enumerate(doc_list)])

        group_texts = self.group_user_prompt.format(QUERY=query, PASSAGES=docs_str, TOPK=len(doc_list))

        message = self.tokenizer.apply_chat_template(
            [{'role': 'system', 'content': self.group_system_prompt}, 
            {'role': 'user', 'content': group_texts}], tokenize=False, add_generation_prompt=True)

        output = self.llm.generate(message, self.sampling_params, use_tqdm=True)

        return output

```
🌹 If you use this model, please ✨star our <a href="https://github.com/AQ-MedAI/Diver" target="_blank">GitHub repository</a> to support us. 
Your star means a lot!