Update README.md
Browse files
README.md
CHANGED
|
@@ -14,7 +14,7 @@ tags:
|
|
| 14 |
|
| 15 |
<h1>Instruction Data Selection via Answer Divergence<h1>
|
| 16 |
<p>
|
| 17 |
-
<strong>English</strong> | <a href="https://huggingface.co/WisdomShell/ADG-
|
| 18 |
</p>
|
| 19 |
|
| 20 |
<a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a>
|
|
@@ -53,83 +53,6 @@ This repository provides the practical pipeline for:
|
|
| 53 |
- benchmark evaluation,
|
| 54 |
- optional task-type analysis.
|
| 55 |
|
| 56 |
-
---
|
| 57 |
-
Use this model ,you need clone follow repository
|
| 58 |
-
```python
|
| 59 |
-
git clone https://github.com/WisdomShell/ADG.git
|
| 60 |
-
```
|
| 61 |
-
|
| 62 |
-
## 📦 What Is Released
|
| 63 |
-
|
| 64 |
-
This repository includes the following components:
|
| 65 |
-
|
| 66 |
-
### Core selection code
|
| 67 |
-
- `ADG/ADG_llama.py`
|
| 68 |
-
ADG scoring and selection for the LLaMA backbone.
|
| 69 |
-
|
| 70 |
-
- `ADG/ADG_qwen.py`
|
| 71 |
-
ADG scoring and selection for the Qwen backbone.
|
| 72 |
-
|
| 73 |
-
### Answer generation and instruction embedding
|
| 74 |
-
- `generation/generation.py`
|
| 75 |
-
Generates multiple sampled answers for each instruction.
|
| 76 |
-
|
| 77 |
-
- `generation/embedding/embed.py`
|
| 78 |
-
Builds instruction embeddings and performs clustering for bin-wise selection.
|
| 79 |
-
|
| 80 |
-
### Training and evaluation
|
| 81 |
-
- `train/train_llama.sh`
|
| 82 |
-
Training entry script for LLaMA.
|
| 83 |
-
|
| 84 |
-
- `train/train_qwen.sh`
|
| 85 |
-
Training entry script for Qwen.
|
| 86 |
-
|
| 87 |
-
- `train/training/stanford_alpaca/`
|
| 88 |
-
Training utilities and backbone-specific training scripts.
|
| 89 |
-
|
| 90 |
-
- `eval/eval.sh`
|
| 91 |
-
Evaluation script based on `lm-evaluation-harness`.
|
| 92 |
-
|
| 93 |
-
### Analysis
|
| 94 |
-
- `analysis/analyse.py`
|
| 95 |
-
Optional task-type classification script for analyzing selected data.
|
| 96 |
-
|
| 97 |
-
### Environment
|
| 98 |
-
- `requirements.txt`
|
| 99 |
-
Required Python packages for this repository.
|
| 100 |
-
|
| 101 |
-
---
|
| 102 |
-
|
| 103 |
-
## 🗂️ Repository Structure
|
| 104 |
-
|
| 105 |
-
```text
|
| 106 |
-
.
|
| 107 |
-
├── README.md
|
| 108 |
-
├── README_zh.md
|
| 109 |
-
├── requirements.txt
|
| 110 |
-
├── ADG/
|
| 111 |
-
│ ├── ADG_llama.py
|
| 112 |
-
│ └── ADG_qwen.py
|
| 113 |
-
├── generation/
|
| 114 |
-
│ ├── generation.py
|
| 115 |
-
│ └── embedding/
|
| 116 |
-
│ └── embed.py
|
| 117 |
-
├── analysis/
|
| 118 |
-
│ └── analyse.py
|
| 119 |
-
├── eval/
|
| 120 |
-
│ └── eval.sh
|
| 121 |
-
└── train/
|
| 122 |
-
├── train_llama.sh
|
| 123 |
-
├── train_qwen.sh
|
| 124 |
-
└── training/
|
| 125 |
-
└── stanford_alpaca/
|
| 126 |
-
├── train_llama.py
|
| 127 |
-
├── train_qwen.py
|
| 128 |
-
├── utils.py
|
| 129 |
-
└── configs/
|
| 130 |
-
```
|
| 131 |
-
|
| 132 |
-
---
|
| 133 |
|
| 134 |
## ⚙️ Installation
|
| 135 |
|
|
@@ -138,6 +61,7 @@ We recommend Python 3.10 or above.
|
|
| 138 |
Example:
|
| 139 |
|
| 140 |
```bash
|
|
|
|
| 141 |
conda create -n adg python=3.12.9
|
| 142 |
conda activate adg
|
| 143 |
pip install -r requirements.txt
|
|
@@ -186,25 +110,6 @@ After answer generation, the intermediate JSONL file contains records like:
|
|
| 186 |
|
| 187 |
---
|
| 188 |
|
| 189 |
-
## 🔄 Pipeline
|
| 190 |
-
|
| 191 |
-
The practical workflow is:
|
| 192 |
-
|
| 193 |
-
```text
|
| 194 |
-
instruction pool
|
| 195 |
-
-> generation/generation.py
|
| 196 |
-
-> multi-sample answer JSONL
|
| 197 |
-
-> generation/embedding/embed.py
|
| 198 |
-
-> instruction embeddings + cluster labels
|
| 199 |
-
-> ADG/ADG_llama.py or ADG/ADG_qwen.py
|
| 200 |
-
-> top / middle / bottom selected subsets
|
| 201 |
-
-> train/train_*.sh
|
| 202 |
-
-> finetuned checkpoints
|
| 203 |
-
-> eval/eval.sh
|
| 204 |
-
```
|
| 205 |
-
|
| 206 |
-
---
|
| 207 |
-
|
| 208 |
## 🚀 Quick Start
|
| 209 |
|
| 210 |
### Step 1. Prepare the instruction pool
|
|
@@ -333,93 +238,15 @@ The evaluation script currently includes:
|
|
| 333 |
|
| 334 |
---
|
| 335 |
|
| 336 |
-
## 📊 ADG Scoring Intuition
|
| 337 |
-
|
| 338 |
-
ADG is built around two complementary signals derived from multiple sampled answers:
|
| 339 |
-
|
| 340 |
-
- **Dispersion magnitude**
|
| 341 |
-
Measures how widely the sampled answers spread in representation space.
|
| 342 |
-
|
| 343 |
-
- **Shape anisotropy**
|
| 344 |
-
Measures whether the spread is multi-directional rather than dominated by a single direction.
|
| 345 |
-
|
| 346 |
-
The final ADG score combines these two parts, and the selected subset is obtained through semantic bin-wise ranking. This design helps avoid collapsing selection into only a few dense instruction regions.
|
| 347 |
-
|
| 348 |
-
---
|
| 349 |
-
|
| 350 |
-
## 🛠️ Script Notes
|
| 351 |
-
|
| 352 |
-
### `generation/generation.py`
|
| 353 |
-
Main functionality:
|
| 354 |
-
- load the base model,
|
| 355 |
-
- sample multiple answers for each instruction,
|
| 356 |
-
- save generated answers in JSONL format,
|
| 357 |
-
- support distributed generation.
|
| 358 |
-
|
| 359 |
-
### `generation/embedding/embed.py`
|
| 360 |
-
Main functionality:
|
| 361 |
-
- build instruction embeddings,
|
| 362 |
-
- run clustering,
|
| 363 |
-
- save instruction embeddings and cluster labels,
|
| 364 |
-
- provide the semantic bins used by ADG selection.
|
| 365 |
-
|
| 366 |
-
### `ADG/ADG_llama.py`
|
| 367 |
-
Main functionality:
|
| 368 |
-
- read the generated-answer JSONL file,
|
| 369 |
-
- compute answer-geometry metrics,
|
| 370 |
-
- combine metrics into the ADG score,
|
| 371 |
-
- perform proportional cluster-based selection,
|
| 372 |
-
- save `top.json`, `middle.json`, and `bottom.json`.
|
| 373 |
-
|
| 374 |
-
### `ADG/ADG_qwen.py`
|
| 375 |
-
Main functionality:
|
| 376 |
-
- compute ADG metrics for Qwen-generated answers,
|
| 377 |
-
- support checkpoint-based resumption,
|
| 378 |
-
- perform the same top / middle / bottom selection pipeline.
|
| 379 |
-
|
| 380 |
-
### `analysis/analyse.py`
|
| 381 |
-
Main functionality:
|
| 382 |
-
- classify instructions into coarse task categories,
|
| 383 |
-
- support optional data-level analysis of selected subsets.
|
| 384 |
-
|
| 385 |
-
### `train/train_llama.sh` and `train/train_qwen.sh`
|
| 386 |
-
Main functionality:
|
| 387 |
-
- launch distributed full fine-tuning,
|
| 388 |
-
- use the selected subset for instruction tuning.
|
| 389 |
-
|
| 390 |
-
### `eval/eval.sh`
|
| 391 |
-
Main functionality:
|
| 392 |
-
- run benchmark evaluation with `lm-evaluation-harness`,
|
| 393 |
-
- support reasoning, knowledge, and coding tasks.
|
| 394 |
-
|
| 395 |
-
---
|
| 396 |
-
|
| 397 |
-
## ❓ Common Issues
|
| 398 |
-
|
| 399 |
-
### 1. Path configuration is not updated
|
| 400 |
-
Most scripts use placeholder paths. Update all required paths before running.
|
| 401 |
-
|
| 402 |
-
### 2. Inconsistent model and intermediate files
|
| 403 |
-
Make sure the generation backbone, embedding backbone, ADG scoring script, and training script are aligned.
|
| 404 |
-
|
| 405 |
-
### 3. Missing intermediate files
|
| 406 |
-
The selector depends on:
|
| 407 |
-
- generated answer JSONL,
|
| 408 |
-
- instruction embeddings,
|
| 409 |
-
- clustering results.
|
| 410 |
-
|
| 411 |
-
Run the previous stages before starting ADG selection.
|
| 412 |
-
|
| 413 |
-
### 4. GPU memory pressure
|
| 414 |
-
Generation, embedding, and scoring all use hidden-state-based processing. You may need to reduce batch size or adjust GPU allocation depending on your hardware.
|
| 415 |
-
|
| 416 |
-
### 5. Evaluation dependency is not installed
|
| 417 |
-
`eval/eval.sh` depends on `lm-evaluation-harness`. Install it separately before running evaluation.
|
| 418 |
-
|
| 419 |
-
---
|
| 420 |
-
|
| 421 |
## 📖 Citation
|
| 422 |
|
| 423 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 424 |
|
| 425 |
---
|
|
|
|
| 14 |
|
| 15 |
<h1>Instruction Data Selection via Answer Divergence<h1>
|
| 16 |
<p>
|
| 17 |
+
<strong>English</strong> | <a href="https://huggingface.co/WisdomShell/ADG-Alpaca-GPT4-LLaMa3-8B/blob/main/README_zh.md">简体中文</a>
|
| 18 |
</p>
|
| 19 |
|
| 20 |
<a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a>
|
|
|
|
| 53 |
- benchmark evaluation,
|
| 54 |
- optional task-type analysis.
|
| 55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
## ⚙️ Installation
|
| 58 |
|
|
|
|
| 61 |
Example:
|
| 62 |
|
| 63 |
```bash
|
| 64 |
+
git clone https://github.com/WisdomShell/ADG.git
|
| 65 |
conda create -n adg python=3.12.9
|
| 66 |
conda activate adg
|
| 67 |
pip install -r requirements.txt
|
|
|
|
| 110 |
|
| 111 |
---
|
| 112 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
## 🚀 Quick Start
|
| 114 |
|
| 115 |
### Step 1. Prepare the instruction pool
|
|
|
|
| 238 |
|
| 239 |
---
|
| 240 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 241 |
## 📖 Citation
|
| 242 |
|
| 243 |
+
```bibtex
|
| 244 |
+
@article{li2026instruction,
|
| 245 |
+
title={Instruction Data Selection via Answer Divergence},
|
| 246 |
+
author={Li, Bo and Wang, Mingda and Zhang, Shikun and Ye, Wei},
|
| 247 |
+
journal={arXiv preprint arXiv:2604.10448},
|
| 248 |
+
year={2026}
|
| 249 |
+
}
|
| 250 |
+
```
|
| 251 |
|
| 252 |
---
|