File size: 6,211 Bytes
50176cf 59d574e 50176cf b6dd17c 50176cf 59d574e 50176cf 59d574e 50176cf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 | ---
license: apache-2.0
language:
- zh
- en
base_model:
- meta-llama/Meta-Llama-3-8B
tags:
- ADG
- SFT
---
<div align="center">
<h1>Instruction Data Selection via Answer Divergence<h1>
<p>
<strong>English</strong> | <a href="https://huggingface.co/WisdomShell/ADG-Alpaca-GPT4-LLaMa3-8B/blob/main/README_zh.md">简体中文</a>
</p>
<a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a>
<a href="https://arxiv.org/abs/2604.10448"><img src="https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white" /></a>
<a href="https://2026.aclweb.org/"><img src="https://img.shields.io/badge/Venue-ACL%202026-blue" /></a>
[](#overview)
<img src="https://img.shields.io/badge/Python-3.10%2B-3776AB?logo=python&logoColor=white" />
**ACL 2026 Main Conference**
<a href="https://deepblue666.github.io/">Bo Li</a>, Mingda Wang, Shikun Zhang, Wei Ye
</div>
This repository releases the core pipeline of **Answer Divergence-Guided Selection (ADG)** for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines **dispersion magnitude** and **shape anisotropy**, then performs **bin-wise selection** for semantic coverage.
---
## 🌟 Overview
Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding.
For each instruction, ADG:
1. samples multiple answers with relatively high-temperature decoding,
2. maps answers into a representation space,
3. computes geometry-aware scores from the sampled answers,
4. ranks examples by the combined score,
5. performs proportional selection within semantic bins.
This repository provides the practical pipeline for:
- multi-sample answer generation,
- instruction embedding and clustering,
- ADG scoring and subset selection,
- model training,
- benchmark evaluation,
- optional task-type analysis.
## ⚙️ Installation
We recommend Python 3.10 or above.
Example:
```bash
git clone https://github.com/WisdomShell/ADG.git
conda create -n adg python=3.12.9
conda activate adg
pip install -r requirements.txt
```
Depending on your environment, you may also need to install GPU-specific packages separately.
---
## 🧾 Data Format
ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below:
```json
{
"id": 0,
"instruction": "Write a short explanation of transformers.",
"input": "",
"output": "Transformers are neural networks based on self-attention..."
}
```
Notes:
- `id` should uniquely identify each example.
- `instruction` is required.
- `input` is optional and can be empty or omitted.
- `output` is the reference response in the original instruction dataset.
- Other instruction datasets can be used as long as they are converted into this format.
After answer generation, the intermediate JSONL file contains records like:
```json
{
"id": 0,
"instruction": "Write a short explanation of transformers.",
"output": "Transformers are neural networks based on self-attention...",
"generated_answers": [
"...",
"...",
"...",
"...",
"..."
]
}
```
---
## 🚀 Quick Start
### Step 1. Prepare the instruction pool
Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format.
### Step 2. Generate multiple answers per instruction
Before running, update the following variables in `generation/generation.py`:
- `MODEL_NAME`
- `OUTPUT_DIR`
- `OUTPUT_FILE`
Then run:
```bash
cd generation
torchrun --nproc_per_node=4 --master_port=29500 generation.py --input_file /path/to/your/instruction_data.json --batch_size 32
```
### Step 3. Build instruction embeddings and clustering results
Before running, update the following variables in `generation/embedding/embed.py`:
- `MODEL_NAME`
- `INPUT_JSONL`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `K_CLUSTERS`
Then run:
```bash
torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.py
```
### Step 4. Run ADG scoring and selection
Choose the scoring script that matches your backbone.
For LLaMA, configure these variables in `ADG/ADG_llama.py`:
- `model_name`
- `INPUT_JSONL`
- `OUTPUT_DIR`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `K_CLUSTERS`
- `FINAL_SELECT_COUNT`
Then run:
```bash
python ADG/ADG_llama.py
```
For Qwen, configure these variables in `ADG/ADG_qwen.py`:
- `model_name`
- `INPUT_JSONL`
- `OUTPUT_DIR`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `CHECKPOINT_DIR`
- `FINAL_SELECT_COUNT`
Then run:
```bash
python ADG/ADG_qwen.py
```
The selector saves:
- `top.json`
- `middle.json`
- `bottom.json`
under the configured `OUTPUT_DIR`.
### Step 5. Train the backbone model
Use the selected subset, typically `top.json`, for instruction tuning.
For LLaMA:
```bash
cd train
bash train_llama.sh
```
For Qwen:
```bash
cd train
bash train_qwen.sh
```
Before running, update paths such as:
- `--model_name_or_path`
- `--data_path`
- `--output_dir`
### Step 6. Evaluate the trained checkpoint
This repository uses `lm-evaluation-harness` for benchmark evaluation.
Install it first if needed:
```bash
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
```
Then configure `MODEL_PATH` and output paths in `eval/eval.sh`, and run:
```bash
cd eval
bash eval.sh
```
The evaluation script currently includes:
- BBH
- GSM8K
- MMLU
- TruthfulQA
- MBPP
- HumanEval
---
## 📖 Citation
```bibtex
@article{li2026instruction,
title={Instruction Data Selection via Answer Divergence},
author={Li, Bo and Wang, Mingda and Zhang, Shikun and Ye, Wei},
journal={arXiv preprint arXiv:2604.10448},
year={2026}
}
```
--- |