| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - xlangai/BRIGHT |
| | - liuwenhan/reasonrank_data_13k |
| | language: |
| | - en |
| | base_model: |
| | - Qwen/Qwen2.5-7B-Instruct |
| | pipeline_tag: text-ranking |
| | library_name: transformers |
| | tags: |
| | - passage ranking |
| | - text-ranking |
| | - reasoning |
| | - Information-Retrieval |
| | --- |
| | |
| |
|
| | ## Introduction |
| |
|
| | This is the model trained in our paper: GroupRank: A Groupwise Reranking Paradigm |
| | Driven by Reinforcement Learning ([📝arXiv](https://arxiv.org/pdf/2511.11653)). |
| | Please refer our [🧩github repository](https://github.com/AQ-MedAI/Diver) for the usage of GroupRank-32B. |
| |
|
| |
|
| | ## Highlights |
| |
|
| | GroupRank is a reinforcement-learning-powered reranker that breaks the traditional **“pointwise vs. listwise”** trade-off: it feeds the query together with a small group of candidates to an LLM, lets the model perform within-group comparisons, and outputs individual relevance scores—combining the flexibility of pointwise scoring with the contextual awareness of listwise ranking. |
| | Training is driven by GRPO and a heterogeneous reward (NDCG + distributional alignment) that keeps scores consistent across groups, while a synthetic data pipeline eliminates the need for large human-labeled sets. |
| | On the reasoning-heavy BRIGHT and R2MED benchmarks, GroupRank sets (7B / 32B) new SOTA **NDCG@10 (46.8 / 52.3)** with strong generalization to classic retrieval tasks. |
| |
|
| | ## Model Performance |
| |
|
| | <!-- <p align="center"> |
| | <img width="90%" alt="image" src="https://8421bcd.oss-cn-beijing.aliyuncs.com/img/image-20250810163757771.png" /> |
| | </p> --> |
| |
|
| | **BRIGHT 基准测试结果 (NDCG@10)** |
| |
|
| | | Models | Bio. | Earth. | Econ. | Psy. | Rob. | Stack. | Sus. | Leet. | Pony | AoPS | TheoQ. | TheoT. | Avg. | |
| | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | |
| | | **Diver-Retriever-4B** | 52.52 | 53.59 | 33.75 | 45.16 | 28.38 | 30.39 | 35.00 | 12.97 | 14.74 | 9.83 | 42.51 | 36.32 | 32.93 | |
| | | **_Non-reasoning reranker_** | | | | | | | | | | | | | | |
| | | rankT5 (3B) | 33.01 | 22.80 | 18.96 | 8.61 | 2.28 | 10.07 | 23.94 | 11.97 | 34.93 | 8.96 | 19.69 | 11.70 | 17.24 | |
| | | RankZephyr (7B) | 42.60 | 19.51 | 18.79 | 29.85 | 13.98 | 13.33 | 29.29 | 12.97 | 31.37 | 7.44 | 31.19 | 31.67 | 23.50 | |
| | | **_Reasoning reranker_** | | | | | | | | | | | | | | |
| | | Rank-R1 (7B) | 40.85 | 31.38 | 23.17 | 32.04 | 19.82 | 11.23 | 35.53 | 3.77 | 6.60 | 4.99 | 17.20 | 30.10 | 21.39 | |
| | | Rank-R1 (14B) | 49.57 | 41.15 | 27.56 | 40.04 | 28.46 | 28.19 | 43.76 | 6.96 | 18.45 | 7.88 | 34.80 | 43.11 | 30.83 | |
| | | Rank-K (32B) | 51.06 | 42.34 | 32.95 | 44.52 | 33.07 | 28.44 | 41.57 | 12.82 | **21.62** | 8.48 | 39.31 | 43.15 | 33.28 | |
| | | ReasonRank (7B)-w20s10 | 51.63 | 43.43 | 32.40 | 43.99 | 31.02 | 25.63 | 39.81 | 15.38 | 20.07 | 6.95 | 38.91 | 40.69 | 32.49 | |
| | | ReasonRank (32B)-w20s10 | 53.89 | 47.59 | 36.34 | **52.64** | 36.48 | 34.17 | **44.47** | 15.21 | 14.81 | 5.47 | 40.63 | 45.29 | 35.58 | |
| | | GroupRank-7B | 56.85 | 53.06 | 35.94 | 48.75 | 36.65 | 35.05 | 43.99 | 15.48 | 15.53 | 10.64 | 41.45 | 46.40 | 36.65 | |
| | | GroupRank-32B | **59.48** | **56.49** | **40.12** | 50.46 | **38.36** | **39.16** | 43.32 | **19.48** | 16.32 | **13.34** | **46.39** | **47.91** | **39.24** | |
| |
|
| |
|
| | ### inference |
| |
|
| | ```python |
| | from vllm import LLM, SamplingParams |
| | from collections import defaultdict |
| | from transformers import AutoTokenizer |
| | import random |
| | import torch |
| | |
| | random.seed(666) |
| | |
| | sys_prompt = '''Your task is to evaluate and rank documents based on how well they help answer the given query. Follow this evaluation priority: |
| | 1. PRIMARY: Usefulness & Helpfulness - Does the document provide actionable information, solutions, or direct answers that help address the user's needs? |
| | 2. SECONDARY: Relevance - Does the document contain information related to the query topic? |
| | |
| | Evaluation Process: |
| | 1. First, identify the user's core intent and what kind of help they need from the query |
| | 2. For each document, assess: |
| | - How directly it addresses the user's intent |
| | - What actionable information or answers it provides |
| | - How much it helps solve the user's problem or need |
| | 3. Compare documents against each other to ensure proper ranking |
| | 4. Assign scores that reflect the relative usefulness ranking |
| | |
| | Scoring Scale (0-10): |
| | - 9-10: Extremely helpful, directly answers the query with actionable information |
| | - 7-8: Very helpful, provides substantial useful information for the query |
| | - 5-6: Moderately helpful, contains some useful information but incomplete |
| | - 3-4: Minimally helpful, limited useful information despite topic relevance |
| | - 1-2: Barely helpful, mentions related topics but provides little useful information |
| | - 0: Not helpful at all, cannot assist with answering the query |
| | ''' |
| | |
| | user_prompt = '''I will provide you {TOPK} documents, each indicated by a numerical identifier []. Score these documents based on their Usefulness and Relevance to the query. |
| | Query: |
| | {QUERY} |
| | |
| | Documents: |
| | {PASSAGES} |
| | |
| | ## Final Output Format |
| | You must structure your response in exactly two parts: provide your brief reasoning process first, then output final scores in JSON format like below, with document IDs as string keys and integer scores as values for all {TOPK} documents. |
| | The reasoning process and answer are enclosed within <reason> </reason> and <answer> </answer> tags, respectively. Do NOT output anything outside the specified tags. Follow this exact format: |
| | <reason> |
| | [Analyze each document's usefulness and relevance to the query, explaining your scoring rationale] |
| | </reason> |
| | <answer> |
| | \```json |
| | {{"[1]": 5, "[2]": 3, "[3]": 8}} |
| | \``` |
| | </answer> |
| | ''' |
| | |
| | class GroupReranker: |
| | def __init__(self, model_path, sys_prompt, user_prompt) -> None: |
| | # vllm offline inference |
| | self.llm = LLM(model=model_path, dtype="bfloat16", gpu_memory_utilization=0.9, tensor_parallel_size=torch.cuda.device_count(), max_model_len=32000) |
| | self.tokenizer = AutoTokenizer.from_pretrained(model_path) |
| | self.sampling_params = SamplingParams(temperature=0.3, top_p=0.8, max_tokens=8000, logprobs=10) |
| | |
| | self.group_system_prompt = sys_prompt |
| | self.group_user_prompt = user_prompt |
| | |
| | def rerank(self, query, doc_list): |
| | docs_str = ''.join(["[{}]. {}\n\n".format(idx+1, doc_text) for idx, doc_text in enumerate(doc_list)]) |
| | |
| | group_texts = self.group_user_prompt.format(QUERY=query, PASSAGES=docs_str, TOPK=len(doc_list)) |
| | |
| | message = self.tokenizer.apply_chat_template( |
| | [{'role': 'system', 'content': self.group_system_prompt}, |
| | {'role': 'user', 'content': group_texts}], tokenize=False, add_generation_prompt=True) |
| | |
| | output = self.llm.generate(message, self.sampling_params, use_tqdm=True) |
| | |
| | return output |
| | |
| | ``` |
| | 🌹 If you use this model, please ✨star our <a href="https://github.com/AQ-MedAI/Diver" target="_blank">GitHub repository</a> to support us. |
| | Your star means a lot! |