File size: 16,034 Bytes

---
license: apache-2.0
datasets:
- hotpotqa/hotpot_qa
- dgslibisey/MuSiQue
- Aman279/Locomo
- Phospheneser/DetectiveQA
language:
- en
- zh
metrics:
- accuracy
- exact_match
- f1
- recall
base_model:
- Qwen/Qwen3-4B-Instruct-2507
pipeline_tag: text-ranking
tags:
- Rerank
- Memory
---
# QRRanker: Query-focused and Memory-aware Reranker for Long Context Processing

<p align="center">
  <a href="https://qdcassie-li.github.io/QRRanker/"><b>🌐 Project Page</b></a> |
  <a href="https://arxiv.org/abs/2602.12192"><b>📄 Paper</b></a> |
  <a href="https://huggingface.co/MindscapeRAG/QRRanker"><b>🤗 Models</b></a>
</p>

QRRanker is a lightweight reranking framework that leverages **Query-focused Retrieval (QR) heads** to produce continuous relevance scores, enabling effective listwise reranking with small-scale models.


## Model Description

Built upon the existing analysis of retrieval heads in large language models, QRRanker trains models to estimate passage–query relevance using the attention scores of selected **Query-focused Retrieval (QR) heads**. These heads are identified through QR score computation on seed data and are particularly effective at capturing query-document relevance signals.

Our approach provides a **listwise solution** that leverages the holistic information within the entire candidate shortlist during ranking. It naturally produces **continuous relevance scores**, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision.

### Key Features

- **Listwise Reranking**: Leverages holistic information within the entire candidate shortlist during ranking
- **Continuous Relevance Scores**: Enables training on arbitrary retrieval datasets without requiring Likert-scale supervision
- **Selective Head Usage**: Focuses on top-performing QR attention heads
- **Memory Enhancement**: Optional contextual summaries for improved accuracy on long narratives and dialogues

## Quick Start


### Basic Usage

```python
import torch
from transformers import AutoModel, AutoConfig, AutoTokenizer

# Load model
config = AutoConfig.from_pretrained("MindscapeRAG/QRRanker", trust_remote_code=True)
model = AutoModel.from_pretrained(
    "MindscapeRAG/QRRanker",
    config=config,
    torch_dtype=torch.float16,
    trust_remote_code=True,
)
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("MindscapeRAG/QRRanker", trust_remote_code=True)
```

## Input Data Format

Input data should be in JSON format. Each sample contains the following fields:

```json
{
    "id": "sample_001",
    "question": "What is the capital of France?",
    "answer": "Paris",
    "paragraphs": [
        {
            "idx": 0,
            "title": "France",
            "paragraph_text": "Paris is the capital and largest city of France...",
            "is_supporting": true
        },
        {
            "idx": 1,
            "title": "Germany",
            "paragraph_text": "Berlin is the capital of Germany...",
            "is_supporting": false
        }
    ],
    "summary": "Optional summary text..."
}
```

### Field Description

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `id` | string | Yes | Unique sample identifier |
| `question` | string | Yes | User query/question |
| `answer` | string | No | Ground truth answer (for evaluation) |
| `paragraphs` | list | Yes | List of candidate paragraphs |
| `paragraphs[].idx` | int | Yes | Paragraph index |
| `paragraphs[].title` | string | No | Paragraph title |
| `paragraphs[].paragraph_text` | string | Yes | Paragraph content |
| `paragraphs[].is_supporting` | bool | No | Whether it's a supporting paragraph (for evaluation) |
| `summary` | string | No | Optional summary information |

## Core Algorithm

### 0. DynamicCacheWithQuery (Custom Cache Class)

This custom cache class is essential for QRRanker. It extends the standard `DynamicCache` to also store query states at specified positions.


```python
from typing import Any, Dict, Optional, Tuple
from transformers.cache_utils import DynamicCache
import torch


class DynamicCacheWithQuery(DynamicCache):
    """
    Custom cache class for QRRanker that stores both key/value states and query states.
    The query states are extracted at specified token positions for attention computation.
    """

    def __init__(self, query_indices=[]) -> None:
        super().__init__()
        self._query_indices = query_indices  # Token indices where query states should be saved
        self.query_cache = []

    def update(
            self,
            key_states: torch.Tensor,
            value_states: torch.Tensor,
            layer_idx: int,
            cache_kwargs: Optional[Dict[str, Any]] = None,
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Updates the cache with new key_states, value_states, and optionally query_states.

        Parameters:
            key_states: New key states to cache [batch, num_kv_heads, seq_len, head_dim]
            value_states: New value states to cache [batch, num_kv_heads, seq_len, head_dim]
            layer_idx: Index of the layer
            cache_kwargs: Optional dict containing 'query_states' to cache

        Returns:
            Tuple of (updated_key_states, updated_value_states)
        """
        # Update seen tokens count
        if layer_idx == 0:
            self._seen_tokens += key_states.shape[-2]

        # Update key/value cache
        if key_states is not None:
            if len(self.key_cache) <= layer_idx:
                for _ in range(len(self.key_cache), layer_idx):
                    self.key_cache.append(torch.tensor([]))
                    self.value_cache.append(torch.tensor([]))
                self.key_cache.append(key_states)
                self.value_cache.append(value_states)
            elif not self.key_cache[layer_idx].numel():
                self.key_cache[layer_idx] = key_states
                self.value_cache[layer_idx] = value_states
            else:
                self.key_cache[layer_idx] = torch.cat(
                    [self.key_cache[layer_idx], key_states], dim=-2
                )
                self.value_cache[layer_idx] = torch.cat(
                    [self.value_cache[layer_idx], value_states], dim=-2
                )

        # Update query cache if query_states provided
        if cache_kwargs is not None:
            query_states = cache_kwargs.get("query_states", None)
        else:
            query_states = None
            
        if query_states is not None:
            if len(self.query_cache) <= layer_idx:
                self.query_cache.append(query_states)
            else:
                self.query_cache[layer_idx] = torch.cat(
                    [self.query_cache[layer_idx], query_states], dim=-2
                )
                
        return self.key_cache[layer_idx], self.value_cache[layer_idx]
```

### 1. Attention Weight Computation

```python
import math
import torch

def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
    """Expand key/value states to match the number of query heads."""
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    if n_rep == 1:
        return hidden_states
    hidden_states = hidden_states[:, :, None, :, :].expand(
        batch, num_key_value_heads, n_rep, slen, head_dim
    )
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)


def get_causal_mask(attn_weights):
    """Generate causal attention mask."""
    query_len, seq_len = attn_weights.size(-2), attn_weights.size(-1)
    causal_mask = torch.ones_like(attn_weights.transpose(-1, -2).squeeze(0))
    causal_mask = torch.triu(causal_mask, diagonal=-(seq_len - query_len))
    causal_mask = causal_mask.transpose(-1, -2)
    causal_mask = (1 - causal_mask) * torch.finfo(causal_mask.dtype).min
    return causal_mask


def get_attn_weights(key_states, query_states):
    """Compute attention weights between query and key states."""
    bsz, num_heads, q_len, head_dim = query_states.size()
    num_key_value_heads = key_states.size(1)
    num_key_value_groups = num_heads // num_key_value_heads
    kv_seq_len = key_states.size(-2)

    # Expand key states to match query heads
    key_states = repeat_kv(key_states, num_key_value_groups)

    # Scaled dot-product attention
    scale = 1.0 / math.sqrt(head_dim)
    scaled_queries = query_states * scale
    attn_weights = torch.matmul(scaled_queries, key_states.transpose(2, 3))

    # Apply causal mask
    causal_mask = get_causal_mask(attn_weights).to(attn_weights.device)
    attn_weights += causal_mask.unsqueeze(0)
    
    # Softmax normalization
    attn_lses = torch.logsumexp(attn_weights, dim=-1, keepdim=True)
    attn_weights = torch.exp(attn_weights - attn_lses)
    
    return attn_weights
```

### 2. QRRanker Score Computation

```python
def compute_qr_scores(
    query_cache,
    key_cache,
    qr_head_list,
    chunk_ranges,
    query_upper_bound,
):
    """
    Compute QRRanker attention scores for document chunks.
    
    Args:
        query_cache: List of query states from each layer
        key_cache: List of key states from each layer
        qr_head_list: String of QR heads, e.g., "20-15,21-11,17-27,..."
        chunk_ranges: List of [start, end] token positions for each chunk
        query_upper_bound: Upper bound token position for query
    
    Returns:
        scores: Tensor of shape [num_chunks] with relevance scores
    """
    all_head_scores = []
    
    for key_state, query_state in zip(key_cache, query_cache):
        # Compute attention weights
        attn_weights = get_attn_weights(
            key_state[:, :, :query_upper_bound, :], 
            query_state
        )
        # Average over query positions
        attn_weights = attn_weights.mean(dim=-2)
        
        # Aggregate scores for each chunk
        chunk_scores = []
        for start, end in chunk_ranges:
            chunk_scores.append(attn_weights[:, :, start:end].sum(dim=-1))
        chunk_scores = torch.stack(chunk_scores, dim=2)
        all_head_scores.append(chunk_scores)
    
    # Stack all layers: [batch, num_layers, num_heads, num_chunks]
    all_head_scores = torch.stack(all_head_scores, dim=1).float()
    
    # Select specific QR heads
    if qr_head_list is not None:
        head_set = [tuple(map(int, h.split('-'))) for h in qr_head_list.split(',')]
        indices = torch.tensor(head_set).to(all_head_scores.device)
        layers, heads = indices[:, 0], indices[:, 1]
        all_head_scores = all_head_scores[:, layers, heads, :]
    
    # Sum over selected heads
    scores = all_head_scores.sum(dim=1).squeeze(0)
    
    return scores
```

### 3. Complete Inference Pipeline

```python
from custom_cache_new import DynamicCacheWithQuery

def rerank_documents(model, tokenizer, question, paragraphs, qr_head_list, device):
    """
    Rerank documents based on QRRanker scores.
    
    Args:
        model: QRRanker model
        tokenizer: Tokenizer
        question: Query string
        paragraphs: List of paragraph dicts with 'idx' and 'paragraph_text'
        qr_head_list: QR head list string (e.g., "20-15,21-11,17-27,...")
        device: torch device
    
    Returns:
        ranked_ids: List of paragraph IDs sorted by relevance
        scores: Corresponding relevance scores
    """
    # Build input sequence
    prompt_prefix = '<|im_start|>user\n'
    retrieval_instruction = "Here are some retrieved chunks:\n\n"
    
    chunk_part = prompt_prefix + retrieval_instruction
    chunk_ranges = []
    
    for i, p in enumerate(paragraphs):
        text = p.get('title', '') + ': ' + p['paragraph_text']
        chunk_part += f"[{i+1}]"
        start = len(chunk_part)
        chunk_part += ' ' + text.strip()
        end = len(chunk_part)
        chunk_ranges.append([start, end])
        chunk_part += '\n\n'
    
    query_part = f"Use the retrieved chunks to answer the user's query.\n\nQuery: {question}"
    full_seq = chunk_part + query_part
    
    # Tokenize with offset mapping
    inputs = tokenizer(
        full_seq, 
        max_length=262144, 
        truncation=True, 
        return_tensors='pt',
        return_offsets_mapping=True,
        add_special_tokens=False
    )
    
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)
    offset_mapping = inputs['offset_mapping'][0]
    
    # Build character-to-token mapping
    char_to_token = {}
    for i, (s, e) in enumerate(offset_mapping):
        for j in range(s, e):
            char_to_token[j] = i
    
    # Map chunk character ranges to token ranges
    token_chunk_ranges = []
    for start, end in chunk_ranges:
        token_start = char_to_token.get(start, 0)
        token_end = char_to_token.get(end - 1, 0) + 1
        token_chunk_ranges.append([token_start, token_end])
    
    # Get query token positions
    query_start_char = full_seq.index(question)
    query_end_char = query_start_char + len(question) - 1
    query_positions = list(range(
        char_to_token[query_start_char],
        char_to_token[query_end_char] + 1
    ))
    query_upper_bound = query_positions[-1] + 1
    
    # Forward pass with custom cache
    with torch.no_grad():
        # Initialize cache with query token positions
        past_kv = DynamicCacheWithQuery(query_indices=query_positions)
        
        # Run model forward pass
        output = model(input_ids, attention_mask, past_key_values=past_kv)
        
        # Extract query and key states from cache
        query_cache = output.past_key_values.query_cache
        key_cache = output.past_key_values.key_cache
        
        # Compute relevance scores
        scores = compute_qr_scores(
            query_cache, key_cache,
            qr_head_list, token_chunk_ranges, query_upper_bound
        )
    
    # Sort by scores (descending)
    sorted_indices = torch.argsort(scores, descending=True).cpu().tolist()
    ranked_ids = [paragraphs[i]['idx'] for i in sorted_indices]
    ranked_scores = [float(scores[i]) for i in sorted_indices]
    
    return ranked_ids, ranked_scores
```

## Model Configuration

The model configuration includes the following QRRanker-specific parameters:

| Parameter | Description |
|-----------|-------------|
| `qr_start_layer` | Starting layer index for QR heads |
| `qr_end_layer` | Ending layer index for QR heads |
| `qr_head_list` | List of (layer, head) tuples for top QR heads |

### Default Top-16 QR Heads

```
20-15, 21-11, 17-27, 23-10, 22-4, 21-10, 21-8, 21-18,
18-15, 18-19, 17-25, 17-17, 24-13, 17-4, 19-12, 21-31
```

## Command Line Usage

```bash
# Basic inference
python qr_ranker_inference.py \
    --base_model MindscapeRAG/QRRanker \
    --data_path /path/to/data.json \
    --mode top16

# With summary
python qr_ranker_inference.py \
    --base_model MindscapeRAG/QRRanker \
    --data_path /path/to/data.json \
    --mode top16 \
    --use_summary
```

### Arguments

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--base_model` | str | required | Path to QRRanker model |
| `--data_path` | str | required | Path to input data file |
| `--output_dir` | str | `./outputs` | Output directory |
| `--mode` | str | `top16` | Mode: `full` (all heads) or `top16` (selected heads) |
| `--qr_head_list` | str | None | Custom QR head list |
| `--use_summary` | flag | False | Use summary field in data |



If you use our QRRanker, please kindly cite:

```bibtex
@misc{li2026queryfocusedmemoryawarererankerlong,
      title={Query-focused and Memory-aware Reranker for Long Context Processing}, 
      author={Yuqing Li and Jiangnan Li and Mo Yu and Guoxuan Ding and Zheng Lin and Weiping Wang and Jie Zhou},
      year={2026},
      eprint={2602.12192},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.12192}, 
}
```

## License

This project is licensed under the Apache 2.0 License.