Update README.md

42a9504 verified 4 months ago

6.82 kB

	---
	license: apache-2.0
	datasets:
	- xlangai/BRIGHT
	- liuwenhan/reasonrank_data_13k
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-7B-Instruct
	pipeline_tag: text-ranking
	library_name: transformers
	tags:
	- passage ranking
	- text-ranking
	- reasoning
	- Information-Retrieval
	---


	## Introduction

	This is the model trained in our paper: GroupRank: A Groupwise Reranking Paradigm
	Driven by Reinforcement Learning ([📝arXiv](https://arxiv.org/pdf/2511.11653)).
	Please refer our [🧩github repository](https://github.com/AQ-MedAI/Diver) for the usage of GroupRank-32B.


	## Highlights

	GroupRank is a reinforcement-learning-powered reranker that breaks the traditional “pointwise vs. listwise” trade-off: it feeds the query together with a small group of candidates to an LLM, lets the model perform within-group comparisons, and outputs individual relevance scores—combining the flexibility of pointwise scoring with the contextual awareness of listwise ranking.
	Training is driven by GRPO and a heterogeneous reward (NDCG + distributional alignment) that keeps scores consistent across groups, while a synthetic data pipeline eliminates the need for large human-labeled sets.
	On the reasoning-heavy BRIGHT and R2MED benchmarks, GroupRank sets (7B / 32B) new SOTA NDCG@10 (46.8 / 52.3) with strong generalization to classic retrieval tasks.

	## Model Performance

	<!-- <p align="center">
	<img width="90%" alt="image" src="https://8421bcd.oss-cn-beijing.aliyuncs.com/img/image-20250810163757771.png" />
	</p> -->

	BRIGHT 基准测试结果 (NDCG@10)

	\| Models \| Bio. \| Earth. \| Econ. \| Psy. \| Rob. \| Stack. \| Sus. \| Leet. \| Pony \| AoPS \| TheoQ. \| TheoT. \| Avg. \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| Diver-Retriever-4B \| 52.52 \| 53.59 \| 33.75 \| 45.16 \| 28.38 \| 30.39 \| 35.00 \| 12.97 \| 14.74 \| 9.83 \| 42.51 \| 36.32 \| 32.93 \|
	\| _Non-reasoning reranker_ \| \| \| \| \| \| \| \| \| \| \| \| \| \|
	\| rankT5 (3B) \| 33.01 \| 22.80 \| 18.96 \| 8.61 \| 2.28 \| 10.07 \| 23.94 \| 11.97 \| 34.93 \| 8.96 \| 19.69 \| 11.70 \| 17.24 \|
	\| RankZephyr (7B) \| 42.60 \| 19.51 \| 18.79 \| 29.85 \| 13.98 \| 13.33 \| 29.29 \| 12.97 \| 31.37 \| 7.44 \| 31.19 \| 31.67 \| 23.50 \|
	\| _Reasoning reranker_ \| \| \| \| \| \| \| \| \| \| \| \| \| \|
	\| Rank-R1 (7B) \| 40.85 \| 31.38 \| 23.17 \| 32.04 \| 19.82 \| 11.23 \| 35.53 \| 3.77 \| 6.60 \| 4.99 \| 17.20 \| 30.10 \| 21.39 \|
	\| Rank-R1 (14B) \| 49.57 \| 41.15 \| 27.56 \| 40.04 \| 28.46 \| 28.19 \| 43.76 \| 6.96 \| 18.45 \| 7.88 \| 34.80 \| 43.11 \| 30.83 \|
	\| Rank-K (32B) \| 51.06 \| 42.34 \| 32.95 \| 44.52 \| 33.07 \| 28.44 \| 41.57 \| 12.82 \| 21.62 \| 8.48 \| 39.31 \| 43.15 \| 33.28 \|
	\| ReasonRank (7B)-w20s10 \| 51.63 \| 43.43 \| 32.40 \| 43.99 \| 31.02 \| 25.63 \| 39.81 \| 15.38 \| 20.07 \| 6.95 \| 38.91 \| 40.69 \| 32.49 \|
	\| ReasonRank (32B)-w20s10 \| 53.89 \| 47.59 \| 36.34 \| 52.64 \| 36.48 \| 34.17 \| 44.47 \| 15.21 \| 14.81 \| 5.47 \| 40.63 \| 45.29 \| 35.58 \|
	\| GroupRank-7B \| 56.85 \| 53.06 \| 35.94 \| 48.75 \| 36.65 \| 35.05 \| 43.99 \| 15.48 \| 15.53 \| 10.64 \| 41.45 \| 46.40 \| 36.65 \|
	\| GroupRank-32B \| 59.48 \| 56.49 \| 40.12 \| 50.46 \| 38.36 \| 39.16 \| 43.32 \| 19.48 \| 16.32 \| 13.34 \| 46.39 \| 47.91 \| 39.24 \|


	### inference

	```python
	from vllm import LLM, SamplingParams
	from collections import defaultdict
	from transformers import AutoTokenizer
	import random
	import torch

	random.seed(666)

	sys_prompt = '''Your task is to evaluate and rank documents based on how well they help answer the given query. Follow this evaluation priority:
	1. PRIMARY: Usefulness & Helpfulness - Does the document provide actionable information, solutions, or direct answers that help address the user's needs?
	2. SECONDARY: Relevance - Does the document contain information related to the query topic?

	Evaluation Process:
	1. First, identify the user's core intent and what kind of help they need from the query
	2. For each document, assess:
	- How directly it addresses the user's intent
	- What actionable information or answers it provides
	- How much it helps solve the user's problem or need
	3. Compare documents against each other to ensure proper ranking
	4. Assign scores that reflect the relative usefulness ranking

	Scoring Scale (0-10):
	- 9-10: Extremely helpful, directly answers the query with actionable information
	- 7-8: Very helpful, provides substantial useful information for the query
	- 5-6: Moderately helpful, contains some useful information but incomplete
	- 3-4: Minimally helpful, limited useful information despite topic relevance
	- 1-2: Barely helpful, mentions related topics but provides little useful information
	- 0: Not helpful at all, cannot assist with answering the query
	'''

	user_prompt = '''I will provide you {TOPK} documents, each indicated by a numerical identifier []. Score these documents based on their Usefulness and Relevance to the query.
	Query:
	{QUERY}

	Documents:
	{PASSAGES}

	## Final Output Format
	You must structure your response in exactly two parts: provide your brief reasoning process first, then output final scores in JSON format like below, with document IDs as string keys and integer scores as values for all {TOPK} documents.
	The reasoning process and answer are enclosed within <reason> </reason> and <answer> </answer> tags, respectively. Do NOT output anything outside the specified tags. Follow this exact format:
	<reason>
	[Analyze each document's usefulness and relevance to the query, explaining your scoring rationale]
	</reason>
	<answer>
	\```json
	{{"[1]": 5, "[2]": 3, "[3]": 8}}
	\```
	</answer>
	'''

	class GroupReranker:
	def __init__(self, model_path, sys_prompt, user_prompt) -> None:
	# vllm offline inference
	self.llm = LLM(model=model_path, dtype="bfloat16", gpu_memory_utilization=0.9, tensor_parallel_size=torch.cuda.device_count(), max_model_len=32000)
	self.tokenizer = AutoTokenizer.from_pretrained(model_path)
	self.sampling_params = SamplingParams(temperature=0.3, top_p=0.8, max_tokens=8000, logprobs=10)

	self.group_system_prompt = sys_prompt
	self.group_user_prompt = user_prompt

	def rerank(self, query, doc_list):
	docs_str = ''.join(["[{}]. {}\n\n".format(idx+1, doc_text) for idx, doc_text in enumerate(doc_list)])

	group_texts = self.group_user_prompt.format(QUERY=query, PASSAGES=docs_str, TOPK=len(doc_list))

	message = self.tokenizer.apply_chat_template(
	[{'role': 'system', 'content': self.group_system_prompt},
	{'role': 'user', 'content': group_texts}], tokenize=False, add_generation_prompt=True)

	output = self.llm.generate(message, self.sampling_params, use_tqdm=True)

	return output

	```
	🌹 If you use this model, please ✨star our <a href="https://github.com/AQ-MedAI/Diver" target="_blank">GitHub repository</a> to support us.
	Your star means a lot!