File size: 10,920 Bytes
871877b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d786704
871877b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
---
license: apache-2.0
language:
- zh
- en
base_model:
- Qwen/Qwen2.5-7B
tags:
- ADG
- SFT
---

<div align="center">
  
<h1>Instruction Data Selection via Answer Divergence<h1>
  <p>
  <strong>English</strong> | <a href="https://huggingface.co/WisdomShell/ADG-WizardLM-Qwen2.5-7B/blob/main/README_zh.md">็ฎ€ไฝ“ไธญๆ–‡</a>
</p>

<a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a>
<a href="https://arxiv.org/abs/2604.07892"><img src="https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white" /></a>
<a href="https://2026.aclweb.org/"><img src="https://img.shields.io/badge/Venue-ACL%202026-blue" /></a>
[![Task](https://img.shields.io/badge/Task-Data%20Selection-purple.svg)](#overview)
<img src="https://img.shields.io/badge/Python-3.10%2B-3776AB?logo=python&logoColor=white" />

**ACL 2026 Main Conference**

<a href="https://deepblue666.github.io/">Bo Li</a>, Mingda Wang, Shikun Zhang, Wei Ye

</div>

This repository releases the core pipeline of **Answer Divergence-Guided Selection (ADG)** for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines **dispersion magnitude** and **shape anisotropy**, then performs **bin-wise selection** for semantic coverage. 

---

## ๐ŸŒŸ Overview

Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding.

For each instruction, ADG:

1. samples multiple answers with relatively high-temperature decoding,
2. maps answers into a representation space,
3. computes geometry-aware scores from the sampled answers,
4. ranks examples by the combined score,
5. performs proportional selection within semantic bins.

This repository provides the practical pipeline for:
- multi-sample answer generation,
- instruction embedding and clustering,
- ADG scoring and subset selection,
- model training,
- benchmark evaluation,
- optional task-type analysis.

---
Use this model ,you need clone follow repository
```python
git clone https://github.com/WisdomShell/ADG.git
```

## ๐Ÿ“ฆ What Is Released

This repository includes the following components:

### Core selection code
- `ADG/ADG_llama.py`  
  ADG scoring and selection for the LLaMA backbone.

- `ADG/ADG_qwen.py`  
  ADG scoring and selection for the Qwen backbone.

### Answer generation and instruction embedding
- `generation/generation.py`  
  Generates multiple sampled answers for each instruction.

- `generation/embedding/embed.py`  
  Builds instruction embeddings and performs clustering for bin-wise selection.

### Training and evaluation
- `train/train_llama.sh`  
  Training entry script for LLaMA.

- `train/train_qwen.sh`  
  Training entry script for Qwen.

- `train/training/stanford_alpaca/`  
  Training utilities and backbone-specific training scripts.

- `eval/eval.sh`  
  Evaluation script based on `lm-evaluation-harness`.

### Analysis
- `analysis/analyse.py`  
  Optional task-type classification script for analyzing selected data.

### Environment
- `requirements.txt`  
  Required Python packages for this repository.

---

## ๐Ÿ—‚๏ธ Repository Structure

```text
.
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ README_zh.md
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ ADG/
โ”‚   โ”œโ”€โ”€ ADG_llama.py
โ”‚   โ””โ”€โ”€ ADG_qwen.py
โ”œโ”€โ”€ generation/
โ”‚   โ”œโ”€โ”€ generation.py
โ”‚   โ””โ”€โ”€ embedding/
โ”‚       โ””โ”€โ”€ embed.py
โ”œโ”€โ”€ analysis/
โ”‚   โ””โ”€โ”€ analyse.py
โ”œโ”€โ”€ eval/
โ”‚   โ””โ”€โ”€ eval.sh
โ””โ”€โ”€ train/
    โ”œโ”€โ”€ train_llama.sh
    โ”œโ”€โ”€ train_qwen.sh
    โ””โ”€โ”€ training/
        โ””โ”€โ”€ stanford_alpaca/
            โ”œโ”€โ”€ train_llama.py
            โ”œโ”€โ”€ train_qwen.py
            โ”œโ”€โ”€ utils.py
            โ””โ”€โ”€ configs/
```

---

## โš™๏ธ Installation

We recommend Python 3.10 or above.

Example:

```bash
conda create -n adg python=3.12.9
conda activate adg
pip install -r requirements.txt
```

Depending on your environment, you may also need to install GPU-specific packages separately.

---

## ๐Ÿงพ Data Format

ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below:

```json
{
  "id": 0,
  "instruction": "Write a short explanation of transformers.",
  "input": "",
  "output": "Transformers are neural networks based on self-attention..."
}
```

Notes:
- `id` should uniquely identify each example.
- `instruction` is required.
- `input` is optional and can be empty or omitted.
- `output` is the reference response in the original instruction dataset.
- Other instruction datasets can be used as long as they are converted into this format.

After answer generation, the intermediate JSONL file contains records like:

```json
{
  "id": 0,
  "instruction": "Write a short explanation of transformers.",
  "output": "Transformers are neural networks based on self-attention...",
  "generated_answers": [
    "...",
    "...",
    "...",
    "...",
    "..."
  ]
}
```

---

## ๐Ÿ”„ Pipeline

The practical workflow is:

```text
instruction pool
    -> generation/generation.py
    -> multi-sample answer JSONL
    -> generation/embedding/embed.py
    -> instruction embeddings + cluster labels
    -> ADG/ADG_llama.py or ADG/ADG_qwen.py
    -> top / middle / bottom selected subsets
    -> train/train_*.sh
    -> finetuned checkpoints
    -> eval/eval.sh
```

---

## ๐Ÿš€ Quick Start

### Step 1. Prepare the instruction pool

Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format.

### Step 2. Generate multiple answers per instruction

Before running, update the following variables in `generation/generation.py`:
- `MODEL_NAME`
- `OUTPUT_DIR`
- `OUTPUT_FILE`

Then run:

```bash
cd generation
torchrun --nproc_per_node=4 --master_port=29500 generation.py   --input_file /path/to/your/instruction_data.json   --batch_size 32
```

### Step 3. Build instruction embeddings and clustering results

Before running, update the following variables in `generation/embedding/embed.py`:
- `MODEL_NAME`
- `INPUT_JSONL`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `K_CLUSTERS`

Then run:

```bash
torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.py
```

### Step 4. Run ADG scoring and selection

Choose the scoring script that matches your backbone.

For LLaMA, configure these variables in `ADG/ADG_llama.py`:
- `model_name`
- `INPUT_JSONL`
- `OUTPUT_DIR`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `K_CLUSTERS`
- `FINAL_SELECT_COUNT`

Then run:

```bash
python ADG/ADG_llama.py
```

For Qwen, configure these variables in `ADG/ADG_qwen.py`:
- `model_name`
- `INPUT_JSONL`
- `OUTPUT_DIR`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `CHECKPOINT_DIR`
- `FINAL_SELECT_COUNT`

Then run:

```bash
python ADG/ADG_qwen.py
```

The selector saves:
- `top.json`
- `middle.json`
- `bottom.json`

under the configured `OUTPUT_DIR`.

### Step 5. Train the backbone model

Use the selected subset, typically `top.json`, for instruction tuning.

For LLaMA:

```bash
cd train
bash train_llama.sh
```

For Qwen:

```bash
cd train
bash train_qwen.sh
```

Before running, update paths such as:
- `--model_name_or_path`
- `--data_path`
- `--output_dir`

### Step 6. Evaluate the trained checkpoint

This repository uses `lm-evaluation-harness` for benchmark evaluation.

Install it first if needed:

```bash
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
```

Then configure `MODEL_PATH` and output paths in `eval/eval.sh`, and run:

```bash
cd eval
bash eval.sh
```

The evaluation script currently includes:
- BBH
- GSM8K
- MMLU
- TruthfulQA
- MBPP
- HumanEval

---

## ๐Ÿ“Š ADG Scoring Intuition

ADG is built around two complementary signals derived from multiple sampled answers:

- **Dispersion magnitude**  
  Measures how widely the sampled answers spread in representation space.

- **Shape anisotropy**  
  Measures whether the spread is multi-directional rather than dominated by a single direction.

The final ADG score combines these two parts, and the selected subset is obtained through semantic bin-wise ranking. This design helps avoid collapsing selection into only a few dense instruction regions.

---

## ๐Ÿ› ๏ธ Script Notes

### `generation/generation.py`
Main functionality:
- load the base model,
- sample multiple answers for each instruction,
- save generated answers in JSONL format,
- support distributed generation.

### `generation/embedding/embed.py`
Main functionality:
- build instruction embeddings,
- run clustering,
- save instruction embeddings and cluster labels,
- provide the semantic bins used by ADG selection.

### `ADG/ADG_llama.py`
Main functionality:
- read the generated-answer JSONL file,
- compute answer-geometry metrics,
- combine metrics into the ADG score,
- perform proportional cluster-based selection,
- save `top.json`, `middle.json`, and `bottom.json`.

### `ADG/ADG_qwen.py`
Main functionality:
- compute ADG metrics for Qwen-generated answers,
- support checkpoint-based resumption,
- perform the same top / middle / bottom selection pipeline.

### `analysis/analyse.py`
Main functionality:
- classify instructions into coarse task categories,
- support optional data-level analysis of selected subsets.

### `train/train_llama.sh` and `train/train_qwen.sh`
Main functionality:
- launch distributed full fine-tuning,
- use the selected subset for instruction tuning.

### `eval/eval.sh`
Main functionality:
- run benchmark evaluation with `lm-evaluation-harness`,
- support reasoning, knowledge, and coding tasks.

---

## โ“ Common Issues

### 1. Path configuration is not updated
Most scripts use placeholder paths. Update all required paths before running.

### 2. Inconsistent model and intermediate files
Make sure the generation backbone, embedding backbone, ADG scoring script, and training script are aligned.

### 3. Missing intermediate files
The selector depends on:
- generated answer JSONL,
- instruction embeddings,
- clustering results.

Run the previous stages before starting ADG selection. 

### 4. GPU memory pressure
Generation, embedding, and scoring all use hidden-state-based processing. You may need to reduce batch size or adjust GPU allocation depending on your hardware.

### 5. Evaluation dependency is not installed
`eval/eval.sh` depends on `lm-evaluation-harness`. Install it separately before running evaluation.

---

## ๐Ÿ“– Citation

If you use this repository, please cite the paper.

---