File size: 5,493 Bytes
88aaaa7
 
 
 
 
 
 
 
007c5e6
 
 
 
 
 
 
 
 
 
 
b1e25b1
007c5e6
 
1e35a15
b1e25b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: "LastingBench: Defend Benchmarks Against Knowledge Leakage"
tags:
  - paper
  - benchmark
license: cc-by-4.0
---

# 📄 Paper

<iframe
  src="https://huggingface.co/kixx/LastingBench/resolve/main/paper.pdf#toolbar=0"
  width="100%"
  height="900"
  style="border:none;">
</iframe>

<!-- 兼容备用: -->
<p><a href="https://huggingface.co/kixx/LastingBench/resolve/main/paper.pdf">📥 Download the PDF</a></p>


# LastingBench: Defend Benchmarks Against Knowledge Leakage.

Welcome to the repository for the research paper: "LastingBench: Defend Benchmarks Against Knowledge Leakage." This project addresses the growing concern about large language models (LLMs) "cheating" on standard Question Answering (QA) benchmarks by memorizing task-specific data, which undermines the validity of benchmark evaluations as they no longer reflect genuine model capabilities but instead the effects of data leakage.

## Project Overview

![Overview](./assets/overview.png)

LastingBench introduces a novel framework designed to continuously reinforce and safeguard existing benchmarks against knowledge leakage. The project aims to:
- **Detect knowledge leakage** through context and question perturbation techniques
- **Rewrite leaked content** to counterfactual alternatives that disrupt memorization while preserving the benchmark's original evaluative intent  
- **Evaluate model responses** to contextual evidence and reasoning patterns
- **Provide practical solutions** to ensure benchmark robustness over time, promoting fairer and more interpretable evaluations of LLMs


## Installation

1. Clone the repository:
```bash
git clone https://github.com/Seriousss/lastingbench
```

2. Create and activate conda environment:
```bash
conda create -n lastingbench python=3.12
conda activate lastingbench
```

3. Install dependencies:
```bash
pip install -r requirements.txt
```

4. Set up environment variables:
```bash
export OPENAI_BASE_URL="your-api-base-url"
export OPENAI_API_KEY="your-api-key"
export CUDA_VISIBLE_DEVICES="0,1,2,3"  # Adjust based on your GPU setup
```

## Usage

LastingBench provides three main functionalities: **Detection**, **Rewrite**, and **Training Comparision**. 

### 🔍 Detection

Detect knowledge leakage through various perturbation techniques.

#### 1. Context Leakage Detection
Evaluate models using exact-match scoring on benchmark datasets:
```bash
# Using vLLM for most models
python -m detect.contextleakage --hf_model "Qwen/Qwen2.5-7B-Instruct" \
    --dataset_subset "hotpotqa" --cuda_devices "0,1"

# Using Transformers for Qwen3 models  
python -m detect.contextleakage --hf_model "Qwen/Qwen3-8B" \
    --is_qwen3 --max_new_tokens 30

python -m detect.contextleakage_api --model "deepseek-r1" --dataset_subset "hotpotqa"
```


#### 2. Question Perturbation Detection
Rephrase questions to opposite meanings and test model consistency:
```bash
# Using OpenAI API
python -m detect.question_rephrase_answer_api \
    --model_name "gpt-4o" --dataset_subset "2wikimqa" \
    --rephrase_type "opposite" --sample_count 100

# Using local vLLM models
python -m detect.question_rephrase_answer_vllm \
    --model_name "Qwen/Qwen2.5-7B-Instruct" --dataset_subset "hotpotqa" --rephrase_type "similar"

# Using Qwen3 with Transformers
python -m detect.question_rephrase_answer_qwen3 \
    --model_name "Qwen/Qwen3-8B" --dataset_subset "2wikimqa"
```


### ✏️ Rewrite

Generate counterfactual answers and rewrite leaked evidence to create robust benchmarks.
`

#### 1. Evidence Finding and Counterfactual Rewriting Pipeline
Run the complete finding and rewriting pipeline:
```bash

# Specify custom output file and dataset
python main_gpu.py --output custom_output.jsonl \
    --dataset_subset "hotpotqa" --start_idx 0 --max_samples 100

```

Convert and merge JSONL files with question-answer mappings:
```bash
# Merge single mapping file with original dataset
python utils/convert.py original.jsonl revised.jsonl custom_output.jsonl

```
The original and revised dataset can be found under the **data** folder.

#### 2. Random Answer Rewriting
Create random alternatives to disrupt memorization:
```bash
# Specify custom output file and dataset
python random_alternative_answer.py --output random_hotpot.jsonl \
    --dataset_subset "hotpotqa" --start_idx 0 --max_samples 50

```


### 🚀Dataset evaluations on model inference and training


#### 1. Model Inference Evaluation
Comprehensive evaluation on original and revised benchmarks:
```bash
# Transformers-based evaluation
python -m eval.evaluation -i data/hotpotqa.jsonl -model "Qwen/Qwen3-8B" -k 40 -t 0.5

# API-based evaluation  
python -m eval.eval_with_api.py --input data/hotpotqa_antifact.jsonl \
    --model "deepseek-r1" --max_tokens 30 --temperature 0.5
```

#### 2. Model training Evaluation
Compare training dynamics between original and rewritten datasets:

The training loss data can be found under **training_result**. 

To repoduce the picture in our paper:
```bash
python utils/draw.py training_result/training_loss_qwen38.csv training_result/training_loss_antifact_qwen38.csv \
    --title "Original vs Rewritten Training Loss"
```



### 📊 Utility Functions

Additional tools for analysis and metrics:

- **Metrics Calculation**: F1 scores, EM scores, and custom evaluation metrics
- **Document Retrieval**: BM25-based retrieval for evidence analysis

All scripts support various parameters for customization. Use `--help` with any script to see available options.