File size: 5,493 Bytes
88aaaa7 007c5e6 b1e25b1 007c5e6 1e35a15 b1e25b1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
---
title: "LastingBench: Defend Benchmarks Against Knowledge Leakage"
tags:
- paper
- benchmark
license: cc-by-4.0
---
# 📄 Paper
<iframe
src="https://huggingface.co/kixx/LastingBench/resolve/main/paper.pdf#toolbar=0"
width="100%"
height="900"
style="border:none;">
</iframe>
<!-- 兼容备用: -->
<p><a href="https://huggingface.co/kixx/LastingBench/resolve/main/paper.pdf">📥 Download the PDF</a></p>
# LastingBench: Defend Benchmarks Against Knowledge Leakage.
Welcome to the repository for the research paper: "LastingBench: Defend Benchmarks Against Knowledge Leakage." This project addresses the growing concern about large language models (LLMs) "cheating" on standard Question Answering (QA) benchmarks by memorizing task-specific data, which undermines the validity of benchmark evaluations as they no longer reflect genuine model capabilities but instead the effects of data leakage.
## Project Overview

LastingBench introduces a novel framework designed to continuously reinforce and safeguard existing benchmarks against knowledge leakage. The project aims to:
- **Detect knowledge leakage** through context and question perturbation techniques
- **Rewrite leaked content** to counterfactual alternatives that disrupt memorization while preserving the benchmark's original evaluative intent
- **Evaluate model responses** to contextual evidence and reasoning patterns
- **Provide practical solutions** to ensure benchmark robustness over time, promoting fairer and more interpretable evaluations of LLMs
## Installation
1. Clone the repository:
```bash
git clone https://github.com/Seriousss/lastingbench
```
2. Create and activate conda environment:
```bash
conda create -n lastingbench python=3.12
conda activate lastingbench
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
4. Set up environment variables:
```bash
export OPENAI_BASE_URL="your-api-base-url"
export OPENAI_API_KEY="your-api-key"
export CUDA_VISIBLE_DEVICES="0,1,2,3" # Adjust based on your GPU setup
```
## Usage
LastingBench provides three main functionalities: **Detection**, **Rewrite**, and **Training Comparision**.
### 🔍 Detection
Detect knowledge leakage through various perturbation techniques.
#### 1. Context Leakage Detection
Evaluate models using exact-match scoring on benchmark datasets:
```bash
# Using vLLM for most models
python -m detect.contextleakage --hf_model "Qwen/Qwen2.5-7B-Instruct" \
--dataset_subset "hotpotqa" --cuda_devices "0,1"
# Using Transformers for Qwen3 models
python -m detect.contextleakage --hf_model "Qwen/Qwen3-8B" \
--is_qwen3 --max_new_tokens 30
python -m detect.contextleakage_api --model "deepseek-r1" --dataset_subset "hotpotqa"
```
#### 2. Question Perturbation Detection
Rephrase questions to opposite meanings and test model consistency:
```bash
# Using OpenAI API
python -m detect.question_rephrase_answer_api \
--model_name "gpt-4o" --dataset_subset "2wikimqa" \
--rephrase_type "opposite" --sample_count 100
# Using local vLLM models
python -m detect.question_rephrase_answer_vllm \
--model_name "Qwen/Qwen2.5-7B-Instruct" --dataset_subset "hotpotqa" --rephrase_type "similar"
# Using Qwen3 with Transformers
python -m detect.question_rephrase_answer_qwen3 \
--model_name "Qwen/Qwen3-8B" --dataset_subset "2wikimqa"
```
### ✏️ Rewrite
Generate counterfactual answers and rewrite leaked evidence to create robust benchmarks.
`
#### 1. Evidence Finding and Counterfactual Rewriting Pipeline
Run the complete finding and rewriting pipeline:
```bash
# Specify custom output file and dataset
python main_gpu.py --output custom_output.jsonl \
--dataset_subset "hotpotqa" --start_idx 0 --max_samples 100
```
Convert and merge JSONL files with question-answer mappings:
```bash
# Merge single mapping file with original dataset
python utils/convert.py original.jsonl revised.jsonl custom_output.jsonl
```
The original and revised dataset can be found under the **data** folder.
#### 2. Random Answer Rewriting
Create random alternatives to disrupt memorization:
```bash
# Specify custom output file and dataset
python random_alternative_answer.py --output random_hotpot.jsonl \
--dataset_subset "hotpotqa" --start_idx 0 --max_samples 50
```
### 🚀Dataset evaluations on model inference and training
#### 1. Model Inference Evaluation
Comprehensive evaluation on original and revised benchmarks:
```bash
# Transformers-based evaluation
python -m eval.evaluation -i data/hotpotqa.jsonl -model "Qwen/Qwen3-8B" -k 40 -t 0.5
# API-based evaluation
python -m eval.eval_with_api.py --input data/hotpotqa_antifact.jsonl \
--model "deepseek-r1" --max_tokens 30 --temperature 0.5
```
#### 2. Model training Evaluation
Compare training dynamics between original and rewritten datasets:
The training loss data can be found under **training_result**.
To repoduce the picture in our paper:
```bash
python utils/draw.py training_result/training_loss_qwen38.csv training_result/training_loss_antifact_qwen38.csv \
--title "Original vs Rewritten Training Loss"
```
### 📊 Utility Functions
Additional tools for analysis and metrics:
- **Metrics Calculation**: F1 scores, EM scores, and custom evaluation metrics
- **Document Retrieval**: BM25-based retrieval for evidence analysis
All scripts support various parameters for customization. Use `--help` with any script to see available options.
|