File size: 6,211 Bytes
db01840
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49fbc80
db01840
 
 
327a811
db01840
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49fbc80
db01840
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49fbc80
 
 
 
 
 
 
 
db01840
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
license: apache-2.0
language:
- zh
- en
base_model:
- meta-llama/Meta-Llama-3-8B
tags:
- ADG
- SFT
---

<div align="center">
  
<h1>Instruction Data Selection via Answer Divergence<h1>
  <p>
  <strong>English</strong> | <a href="https://huggingface.co/WisdomShell/ADG-Alpaca-GPT4-LLaMa3-8B/blob/main/README_zh.md">简体中文</a>
</p>

<a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a>
<a href="https://arxiv.org/abs/2604.10448"><img src="https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white" /></a>
<a href="https://2026.aclweb.org/"><img src="https://img.shields.io/badge/Venue-ACL%202026-blue" /></a>
[![Task](https://img.shields.io/badge/Task-Data%20Selection-purple.svg)](#overview)
<img src="https://img.shields.io/badge/Python-3.10%2B-3776AB?logo=python&logoColor=white" />

**ACL 2026 Main Conference**

<a href="https://deepblue666.github.io/">Bo Li</a>, Mingda Wang, Shikun Zhang, Wei Ye

</div>

This repository releases the core pipeline of **Answer Divergence-Guided Selection (ADG)** for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines **dispersion magnitude** and **shape anisotropy**, then performs **bin-wise selection** for semantic coverage. 

---

## 🌟 Overview

Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding.

For each instruction, ADG:

1. samples multiple answers with relatively high-temperature decoding,
2. maps answers into a representation space,
3. computes geometry-aware scores from the sampled answers,
4. ranks examples by the combined score,
5. performs proportional selection within semantic bins.

This repository provides the practical pipeline for:
- multi-sample answer generation,
- instruction embedding and clustering,
- ADG scoring and subset selection,
- model training,
- benchmark evaluation,
- optional task-type analysis.


## ⚙️ Installation

We recommend Python 3.10 or above.

Example:

```bash
git clone https://github.com/WisdomShell/ADG.git
conda create -n adg python=3.12.9
conda activate adg
pip install -r requirements.txt
```

Depending on your environment, you may also need to install GPU-specific packages separately.

---

## 🧾 Data Format

ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below:

```json
{
  "id": 0,
  "instruction": "Write a short explanation of transformers.",
  "input": "",
  "output": "Transformers are neural networks based on self-attention..."
}
```

Notes:
- `id` should uniquely identify each example.
- `instruction` is required.
- `input` is optional and can be empty or omitted.
- `output` is the reference response in the original instruction dataset.
- Other instruction datasets can be used as long as they are converted into this format.

After answer generation, the intermediate JSONL file contains records like:

```json
{
  "id": 0,
  "instruction": "Write a short explanation of transformers.",
  "output": "Transformers are neural networks based on self-attention...",
  "generated_answers": [
    "...",
    "...",
    "...",
    "...",
    "..."
  ]
}
```

---

## 🚀 Quick Start

### Step 1. Prepare the instruction pool

Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format.

### Step 2. Generate multiple answers per instruction

Before running, update the following variables in `generation/generation.py`:
- `MODEL_NAME`
- `OUTPUT_DIR`
- `OUTPUT_FILE`

Then run:

```bash
cd generation
torchrun --nproc_per_node=4 --master_port=29500 generation.py   --input_file /path/to/your/instruction_data.json   --batch_size 32
```

### Step 3. Build instruction embeddings and clustering results

Before running, update the following variables in `generation/embedding/embed.py`:
- `MODEL_NAME`
- `INPUT_JSONL`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `K_CLUSTERS`

Then run:

```bash
torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.py
```

### Step 4. Run ADG scoring and selection

Choose the scoring script that matches your backbone.

For LLaMA, configure these variables in `ADG/ADG_llama.py`:
- `model_name`
- `INPUT_JSONL`
- `OUTPUT_DIR`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `K_CLUSTERS`
- `FINAL_SELECT_COUNT`

Then run:

```bash
python ADG/ADG_llama.py
```

For Qwen, configure these variables in `ADG/ADG_qwen.py`:
- `model_name`
- `INPUT_JSONL`
- `OUTPUT_DIR`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `CHECKPOINT_DIR`
- `FINAL_SELECT_COUNT`

Then run:

```bash
python ADG/ADG_qwen.py
```

The selector saves:
- `top.json`
- `middle.json`
- `bottom.json`

under the configured `OUTPUT_DIR`.

### Step 5. Train the backbone model

Use the selected subset, typically `top.json`, for instruction tuning.

For LLaMA:

```bash
cd train
bash train_llama.sh
```

For Qwen:

```bash
cd train
bash train_qwen.sh
```

Before running, update paths such as:
- `--model_name_or_path`
- `--data_path`
- `--output_dir`

### Step 6. Evaluate the trained checkpoint

This repository uses `lm-evaluation-harness` for benchmark evaluation.

Install it first if needed:

```bash
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
```

Then configure `MODEL_PATH` and output paths in `eval/eval.sh`, and run:

```bash
cd eval
bash eval.sh
```

The evaluation script currently includes:
- BBH
- GSM8K
- MMLU
- TruthfulQA
- MBPP
- HumanEval

---

## 📖 Citation

```bibtex
@article{li2026instruction,
  title={Instruction Data Selection via Answer Divergence},
  author={Li, Bo and Wang, Mingda and Zhang, Shikun and Ye, Wei},
  journal={arXiv preprint arXiv:2604.10448},
  year={2026}
}
```

---