WarrenWang01 commited on
Commit
db01840
·
verified ·
1 Parent(s): 1187670

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +425 -3
README.md CHANGED
@@ -1,3 +1,425 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ - en
6
+ base_model:
7
+ - meta-llama/Meta-Llama-3-8B
8
+ tags:
9
+ - ADG
10
+ - SFT
11
+ ---
12
+
13
+ <div align="center">
14
+
15
+ <h1>Instruction Data Selection via Answer Divergence<h1>
16
+ <p>
17
+ <strong>English</strong> | <a href="https://huggingface.co/WisdomShell/GRIP-Llama-3-8B/blob/main/README_zh.md">简体中文</a>
18
+ </p>
19
+
20
+ <a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a>
21
+ <a href="https://arxiv.org/abs/2604.07892"><img src="https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white" /></a>
22
+ <a href="https://2026.aclweb.org/"><img src="https://img.shields.io/badge/Venue-ACL%202026-blue" /></a>
23
+ [![Task](https://img.shields.io/badge/Task-Data%20Selection-purple.svg)](#overview)
24
+ <img src="https://img.shields.io/badge/Python-3.10%2B-3776AB?logo=python&logoColor=white" />
25
+
26
+ **ACL 2026 Main Conference**
27
+
28
+ <a href="https://deepblue666.github.io/">Bo Li</a>, Mingda Wang, Shikun Zhang, Wei Ye
29
+
30
+ </div>
31
+
32
+ This repository releases the core pipeline of **Answer Divergence-Guided Selection (ADG)** for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines **dispersion magnitude** and **shape anisotropy**, then performs **bin-wise selection** for semantic coverage.
33
+
34
+ ---
35
+
36
+ ## 🌟 Overview
37
+
38
+ Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding.
39
+
40
+ For each instruction, ADG:
41
+
42
+ 1. samples multiple answers with relatively high-temperature decoding,
43
+ 2. maps answers into a representation space,
44
+ 3. computes geometry-aware scores from the sampled answers,
45
+ 4. ranks examples by the combined score,
46
+ 5. performs proportional selection within semantic bins.
47
+
48
+ This repository provides the practical pipeline for:
49
+ - multi-sample answer generation,
50
+ - instruction embedding and clustering,
51
+ - ADG scoring and subset selection,
52
+ - model training,
53
+ - benchmark evaluation,
54
+ - optional task-type analysis.
55
+
56
+ ---
57
+ Use this model ,you need clone follow repository
58
+ ```python
59
+ git clone https://github.com/WisdomShell/ADG.git
60
+ ```
61
+
62
+ ## 📦 What Is Released
63
+
64
+ This repository includes the following components:
65
+
66
+ ### Core selection code
67
+ - `ADG/ADG_llama.py`
68
+ ADG scoring and selection for the LLaMA backbone.
69
+
70
+ - `ADG/ADG_qwen.py`
71
+ ADG scoring and selection for the Qwen backbone.
72
+
73
+ ### Answer generation and instruction embedding
74
+ - `generation/generation.py`
75
+ Generates multiple sampled answers for each instruction.
76
+
77
+ - `generation/embedding/embed.py`
78
+ Builds instruction embeddings and performs clustering for bin-wise selection.
79
+
80
+ ### Training and evaluation
81
+ - `train/train_llama.sh`
82
+ Training entry script for LLaMA.
83
+
84
+ - `train/train_qwen.sh`
85
+ Training entry script for Qwen.
86
+
87
+ - `train/training/stanford_alpaca/`
88
+ Training utilities and backbone-specific training scripts.
89
+
90
+ - `eval/eval.sh`
91
+ Evaluation script based on `lm-evaluation-harness`.
92
+
93
+ ### Analysis
94
+ - `analysis/analyse.py`
95
+ Optional task-type classification script for analyzing selected data.
96
+
97
+ ### Environment
98
+ - `requirements.txt`
99
+ Required Python packages for this repository.
100
+
101
+ ---
102
+
103
+ ## 🗂️ Repository Structure
104
+
105
+ ```text
106
+ .
107
+ ├── README.md
108
+ ├── README_zh.md
109
+ ├── requirements.txt
110
+ ├── ADG/
111
+ │ ├── ADG_llama.py
112
+ │ └── ADG_qwen.py
113
+ ├── generation/
114
+ │ ├── generation.py
115
+ │ └── embedding/
116
+ │ └── embed.py
117
+ ├── analysis/
118
+ │ └── analyse.py
119
+ ├── eval/
120
+ │ └── eval.sh
121
+ └── train/
122
+ ├── train_llama.sh
123
+ ├── train_qwen.sh
124
+ └── training/
125
+ └── stanford_alpaca/
126
+ ├── train_llama.py
127
+ ├── train_qwen.py
128
+ ├── utils.py
129
+ └── configs/
130
+ ```
131
+
132
+ ---
133
+
134
+ ## ⚙️ Installation
135
+
136
+ We recommend Python 3.10 or above.
137
+
138
+ Example:
139
+
140
+ ```bash
141
+ conda create -n adg python=3.12.9
142
+ conda activate adg
143
+ pip install -r requirements.txt
144
+ ```
145
+
146
+ Depending on your environment, you may also need to install GPU-specific packages separately.
147
+
148
+ ---
149
+
150
+ ## 🧾 Data Format
151
+
152
+ ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below:
153
+
154
+ ```json
155
+ {
156
+ "id": 0,
157
+ "instruction": "Write a short explanation of transformers.",
158
+ "input": "",
159
+ "output": "Transformers are neural networks based on self-attention..."
160
+ }
161
+ ```
162
+
163
+ Notes:
164
+ - `id` should uniquely identify each example.
165
+ - `instruction` is required.
166
+ - `input` is optional and can be empty or omitted.
167
+ - `output` is the reference response in the original instruction dataset.
168
+ - Other instruction datasets can be used as long as they are converted into this format.
169
+
170
+ After answer generation, the intermediate JSONL file contains records like:
171
+
172
+ ```json
173
+ {
174
+ "id": 0,
175
+ "instruction": "Write a short explanation of transformers.",
176
+ "output": "Transformers are neural networks based on self-attention...",
177
+ "generated_answers": [
178
+ "...",
179
+ "...",
180
+ "...",
181
+ "...",
182
+ "..."
183
+ ]
184
+ }
185
+ ```
186
+
187
+ ---
188
+
189
+ ## 🔄 Pipeline
190
+
191
+ The practical workflow is:
192
+
193
+ ```text
194
+ instruction pool
195
+ -> generation/generation.py
196
+ -> multi-sample answer JSONL
197
+ -> generation/embedding/embed.py
198
+ -> instruction embeddings + cluster labels
199
+ -> ADG/ADG_llama.py or ADG/ADG_qwen.py
200
+ -> top / middle / bottom selected subsets
201
+ -> train/train_*.sh
202
+ -> finetuned checkpoints
203
+ -> eval/eval.sh
204
+ ```
205
+
206
+ ---
207
+
208
+ ## 🚀 Quick Start
209
+
210
+ ### Step 1. Prepare the instruction pool
211
+
212
+ Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format.
213
+
214
+ ### Step 2. Generate multiple answers per instruction
215
+
216
+ Before running, update the following variables in `generation/generation.py`:
217
+ - `MODEL_NAME`
218
+ - `OUTPUT_DIR`
219
+ - `OUTPUT_FILE`
220
+
221
+ Then run:
222
+
223
+ ```bash
224
+ cd generation
225
+ torchrun --nproc_per_node=4 --master_port=29500 generation.py --input_file /path/to/your/instruction_data.json --batch_size 32
226
+ ```
227
+
228
+ ### Step 3. Build instruction embeddings and clustering results
229
+
230
+ Before running, update the following variables in `generation/embedding/embed.py`:
231
+ - `MODEL_NAME`
232
+ - `INPUT_JSONL`
233
+ - `EMBEDDINGS_PATH`
234
+ - `CLUSTERS_PATH`
235
+ - `K_CLUSTERS`
236
+
237
+ Then run:
238
+
239
+ ```bash
240
+ torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.py
241
+ ```
242
+
243
+ ### Step 4. Run ADG scoring and selection
244
+
245
+ Choose the scoring script that matches your backbone.
246
+
247
+ For LLaMA, configure these variables in `ADG/ADG_llama.py`:
248
+ - `model_name`
249
+ - `INPUT_JSONL`
250
+ - `OUTPUT_DIR`
251
+ - `EMBEDDINGS_PATH`
252
+ - `CLUSTERS_PATH`
253
+ - `K_CLUSTERS`
254
+ - `FINAL_SELECT_COUNT`
255
+
256
+ Then run:
257
+
258
+ ```bash
259
+ python ADG/ADG_llama.py
260
+ ```
261
+
262
+ For Qwen, configure these variables in `ADG/ADG_qwen.py`:
263
+ - `model_name`
264
+ - `INPUT_JSONL`
265
+ - `OUTPUT_DIR`
266
+ - `EMBEDDINGS_PATH`
267
+ - `CLUSTERS_PATH`
268
+ - `CHECKPOINT_DIR`
269
+ - `FINAL_SELECT_COUNT`
270
+
271
+ Then run:
272
+
273
+ ```bash
274
+ python ADG/ADG_qwen.py
275
+ ```
276
+
277
+ The selector saves:
278
+ - `top.json`
279
+ - `middle.json`
280
+ - `bottom.json`
281
+
282
+ under the configured `OUTPUT_DIR`.
283
+
284
+ ### Step 5. Train the backbone model
285
+
286
+ Use the selected subset, typically `top.json`, for instruction tuning.
287
+
288
+ For LLaMA:
289
+
290
+ ```bash
291
+ cd train
292
+ bash train_llama.sh
293
+ ```
294
+
295
+ For Qwen:
296
+
297
+ ```bash
298
+ cd train
299
+ bash train_qwen.sh
300
+ ```
301
+
302
+ Before running, update paths such as:
303
+ - `--model_name_or_path`
304
+ - `--data_path`
305
+ - `--output_dir`
306
+
307
+ ### Step 6. Evaluate the trained checkpoint
308
+
309
+ This repository uses `lm-evaluation-harness` for benchmark evaluation.
310
+
311
+ Install it first if needed:
312
+
313
+ ```bash
314
+ git clone https://github.com/EleutherAI/lm-evaluation-harness.git
315
+ cd lm-evaluation-harness
316
+ pip install -e .
317
+ ```
318
+
319
+ Then configure `MODEL_PATH` and output paths in `eval/eval.sh`, and run:
320
+
321
+ ```bash
322
+ cd eval
323
+ bash eval.sh
324
+ ```
325
+
326
+ The evaluation script currently includes:
327
+ - BBH
328
+ - GSM8K
329
+ - MMLU
330
+ - TruthfulQA
331
+ - MBPP
332
+ - HumanEval
333
+
334
+ ---
335
+
336
+ ## 📊 ADG Scoring Intuition
337
+
338
+ ADG is built around two complementary signals derived from multiple sampled answers:
339
+
340
+ - **Dispersion magnitude**
341
+ Measures how widely the sampled answers spread in representation space.
342
+
343
+ - **Shape anisotropy**
344
+ Measures whether the spread is multi-directional rather than dominated by a single direction.
345
+
346
+ The final ADG score combines these two parts, and the selected subset is obtained through semantic bin-wise ranking. This design helps avoid collapsing selection into only a few dense instruction regions.
347
+
348
+ ---
349
+
350
+ ## 🛠️ Script Notes
351
+
352
+ ### `generation/generation.py`
353
+ Main functionality:
354
+ - load the base model,
355
+ - sample multiple answers for each instruction,
356
+ - save generated answers in JSONL format,
357
+ - support distributed generation.
358
+
359
+ ### `generation/embedding/embed.py`
360
+ Main functionality:
361
+ - build instruction embeddings,
362
+ - run clustering,
363
+ - save instruction embeddings and cluster labels,
364
+ - provide the semantic bins used by ADG selection.
365
+
366
+ ### `ADG/ADG_llama.py`
367
+ Main functionality:
368
+ - read the generated-answer JSONL file,
369
+ - compute answer-geometry metrics,
370
+ - combine metrics into the ADG score,
371
+ - perform proportional cluster-based selection,
372
+ - save `top.json`, `middle.json`, and `bottom.json`.
373
+
374
+ ### `ADG/ADG_qwen.py`
375
+ Main functionality:
376
+ - compute ADG metrics for Qwen-generated answers,
377
+ - support checkpoint-based resumption,
378
+ - perform the same top / middle / bottom selection pipeline.
379
+
380
+ ### `analysis/analyse.py`
381
+ Main functionality:
382
+ - classify instructions into coarse task categories,
383
+ - support optional data-level analysis of selected subsets.
384
+
385
+ ### `train/train_llama.sh` and `train/train_qwen.sh`
386
+ Main functionality:
387
+ - launch distributed full fine-tuning,
388
+ - use the selected subset for instruction tuning.
389
+
390
+ ### `eval/eval.sh`
391
+ Main functionality:
392
+ - run benchmark evaluation with `lm-evaluation-harness`,
393
+ - support reasoning, knowledge, and coding tasks.
394
+
395
+ ---
396
+
397
+ ## ❓ Common Issues
398
+
399
+ ### 1. Path configuration is not updated
400
+ Most scripts use placeholder paths. Update all required paths before running.
401
+
402
+ ### 2. Inconsistent model and intermediate files
403
+ Make sure the generation backbone, embedding backbone, ADG scoring script, and training script are aligned.
404
+
405
+ ### 3. Missing intermediate files
406
+ The selector depends on:
407
+ - generated answer JSONL,
408
+ - instruction embeddings,
409
+ - clustering results.
410
+
411
+ Run the previous stages before starting ADG selection.
412
+
413
+ ### 4. GPU memory pressure
414
+ Generation, embedding, and scoring all use hidden-state-based processing. You may need to reduce batch size or adjust GPU allocation depending on your hardware.
415
+
416
+ ### 5. Evaluation dependency is not installed
417
+ `eval/eval.sh` depends on `lm-evaluation-harness`. Install it separately before running evaluation.
418
+
419
+ ---
420
+
421
+ ## 📖 Citation
422
+
423
+ If you use this repository, please cite the paper.
424
+
425
+ ---