WarrenWang01 commited on
Commit
59d574e
·
verified ·
1 Parent(s): a297daa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -183
README.md CHANGED
@@ -14,7 +14,7 @@ tags:
14
 
15
  <h1>Instruction Data Selection via Answer Divergence<h1>
16
  <p>
17
- <strong>English</strong> | <a href="https://huggingface.co/WisdomShell/ADG-WizardLM-LLaMa3-8B/blob/main/README_zh.md">简体中文</a>
18
  </p>
19
 
20
  <a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a>
@@ -53,83 +53,6 @@ This repository provides the practical pipeline for:
53
  - benchmark evaluation,
54
  - optional task-type analysis.
55
 
56
- ---
57
- Use this model ,you need clone follow repository
58
- ```python
59
- git clone https://github.com/WisdomShell/ADG.git
60
- ```
61
-
62
- ## 📦 What Is Released
63
-
64
- This repository includes the following components:
65
-
66
- ### Core selection code
67
- - `ADG/ADG_llama.py`
68
- ADG scoring and selection for the LLaMA backbone.
69
-
70
- - `ADG/ADG_qwen.py`
71
- ADG scoring and selection for the Qwen backbone.
72
-
73
- ### Answer generation and instruction embedding
74
- - `generation/generation.py`
75
- Generates multiple sampled answers for each instruction.
76
-
77
- - `generation/embedding/embed.py`
78
- Builds instruction embeddings and performs clustering for bin-wise selection.
79
-
80
- ### Training and evaluation
81
- - `train/train_llama.sh`
82
- Training entry script for LLaMA.
83
-
84
- - `train/train_qwen.sh`
85
- Training entry script for Qwen.
86
-
87
- - `train/training/stanford_alpaca/`
88
- Training utilities and backbone-specific training scripts.
89
-
90
- - `eval/eval.sh`
91
- Evaluation script based on `lm-evaluation-harness`.
92
-
93
- ### Analysis
94
- - `analysis/analyse.py`
95
- Optional task-type classification script for analyzing selected data.
96
-
97
- ### Environment
98
- - `requirements.txt`
99
- Required Python packages for this repository.
100
-
101
- ---
102
-
103
- ## 🗂️ Repository Structure
104
-
105
- ```text
106
- .
107
- ├── README.md
108
- ├── README_zh.md
109
- ├── requirements.txt
110
- ├── ADG/
111
- │ ├── ADG_llama.py
112
- │ └── ADG_qwen.py
113
- ├── generation/
114
- │ ├── generation.py
115
- │ └── embedding/
116
- │ └── embed.py
117
- ├── analysis/
118
- │ └── analyse.py
119
- ├── eval/
120
- │ └── eval.sh
121
- └── train/
122
- ├── train_llama.sh
123
- ├── train_qwen.sh
124
- └── training/
125
- └── stanford_alpaca/
126
- ├── train_llama.py
127
- ├── train_qwen.py
128
- ├── utils.py
129
- └── configs/
130
- ```
131
-
132
- ---
133
 
134
  ## ⚙️ Installation
135
 
@@ -138,6 +61,7 @@ We recommend Python 3.10 or above.
138
  Example:
139
 
140
  ```bash
 
141
  conda create -n adg python=3.12.9
142
  conda activate adg
143
  pip install -r requirements.txt
@@ -186,25 +110,6 @@ After answer generation, the intermediate JSONL file contains records like:
186
 
187
  ---
188
 
189
- ## 🔄 Pipeline
190
-
191
- The practical workflow is:
192
-
193
- ```text
194
- instruction pool
195
- -> generation/generation.py
196
- -> multi-sample answer JSONL
197
- -> generation/embedding/embed.py
198
- -> instruction embeddings + cluster labels
199
- -> ADG/ADG_llama.py or ADG/ADG_qwen.py
200
- -> top / middle / bottom selected subsets
201
- -> train/train_*.sh
202
- -> finetuned checkpoints
203
- -> eval/eval.sh
204
- ```
205
-
206
- ---
207
-
208
  ## 🚀 Quick Start
209
 
210
  ### Step 1. Prepare the instruction pool
@@ -333,93 +238,15 @@ The evaluation script currently includes:
333
 
334
  ---
335
 
336
- ## 📊 ADG Scoring Intuition
337
-
338
- ADG is built around two complementary signals derived from multiple sampled answers:
339
-
340
- - **Dispersion magnitude**
341
- Measures how widely the sampled answers spread in representation space.
342
-
343
- - **Shape anisotropy**
344
- Measures whether the spread is multi-directional rather than dominated by a single direction.
345
-
346
- The final ADG score combines these two parts, and the selected subset is obtained through semantic bin-wise ranking. This design helps avoid collapsing selection into only a few dense instruction regions.
347
-
348
- ---
349
-
350
- ## 🛠️ Script Notes
351
-
352
- ### `generation/generation.py`
353
- Main functionality:
354
- - load the base model,
355
- - sample multiple answers for each instruction,
356
- - save generated answers in JSONL format,
357
- - support distributed generation.
358
-
359
- ### `generation/embedding/embed.py`
360
- Main functionality:
361
- - build instruction embeddings,
362
- - run clustering,
363
- - save instruction embeddings and cluster labels,
364
- - provide the semantic bins used by ADG selection.
365
-
366
- ### `ADG/ADG_llama.py`
367
- Main functionality:
368
- - read the generated-answer JSONL file,
369
- - compute answer-geometry metrics,
370
- - combine metrics into the ADG score,
371
- - perform proportional cluster-based selection,
372
- - save `top.json`, `middle.json`, and `bottom.json`.
373
-
374
- ### `ADG/ADG_qwen.py`
375
- Main functionality:
376
- - compute ADG metrics for Qwen-generated answers,
377
- - support checkpoint-based resumption,
378
- - perform the same top / middle / bottom selection pipeline.
379
-
380
- ### `analysis/analyse.py`
381
- Main functionality:
382
- - classify instructions into coarse task categories,
383
- - support optional data-level analysis of selected subsets.
384
-
385
- ### `train/train_llama.sh` and `train/train_qwen.sh`
386
- Main functionality:
387
- - launch distributed full fine-tuning,
388
- - use the selected subset for instruction tuning.
389
-
390
- ### `eval/eval.sh`
391
- Main functionality:
392
- - run benchmark evaluation with `lm-evaluation-harness`,
393
- - support reasoning, knowledge, and coding tasks.
394
-
395
- ---
396
-
397
- ## ❓ Common Issues
398
-
399
- ### 1. Path configuration is not updated
400
- Most scripts use placeholder paths. Update all required paths before running.
401
-
402
- ### 2. Inconsistent model and intermediate files
403
- Make sure the generation backbone, embedding backbone, ADG scoring script, and training script are aligned.
404
-
405
- ### 3. Missing intermediate files
406
- The selector depends on:
407
- - generated answer JSONL,
408
- - instruction embeddings,
409
- - clustering results.
410
-
411
- Run the previous stages before starting ADG selection.
412
-
413
- ### 4. GPU memory pressure
414
- Generation, embedding, and scoring all use hidden-state-based processing. You may need to reduce batch size or adjust GPU allocation depending on your hardware.
415
-
416
- ### 5. Evaluation dependency is not installed
417
- `eval/eval.sh` depends on `lm-evaluation-harness`. Install it separately before running evaluation.
418
-
419
- ---
420
-
421
  ## 📖 Citation
422
 
423
- If you use this repository, please cite the paper.
 
 
 
 
 
 
 
424
 
425
  ---
 
14
 
15
  <h1>Instruction Data Selection via Answer Divergence<h1>
16
  <p>
17
+ <strong>English</strong> | <a href="https://huggingface.co/WisdomShell/ADG-Alpaca-GPT4-LLaMa3-8B/blob/main/README_zh.md">简体中文</a>
18
  </p>
19
 
20
  <a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a>
 
53
  - benchmark evaluation,
54
  - optional task-type analysis.
55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
  ## ⚙️ Installation
58
 
 
61
  Example:
62
 
63
  ```bash
64
+ git clone https://github.com/WisdomShell/ADG.git
65
  conda create -n adg python=3.12.9
66
  conda activate adg
67
  pip install -r requirements.txt
 
110
 
111
  ---
112
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  ## 🚀 Quick Start
114
 
115
  ### Step 1. Prepare the instruction pool
 
238
 
239
  ---
240
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
241
  ## 📖 Citation
242
 
243
+ ```bibtex
244
+ @article{li2026instruction,
245
+ title={Instruction Data Selection via Answer Divergence},
246
+ author={Li, Bo and Wang, Mingda and Zhang, Shikun and Ye, Wei},
247
+ journal={arXiv preprint arXiv:2604.10448},
248
+ year={2026}
249
+ }
250
+ ```
251
 
252
  ---