Update README.md

b25f4f9 verified 4 days ago

10.9 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	base_model:
	- Qwen/Qwen2.5-7B
	tags:
	- ADG
	- SFT
	---

	<div align="center">

	<h1>Instruction Data Selection via Answer Divergence<h1>
	<p>
	<strong>English</strong> \| <a href="https://huggingface.co/WisdomShell/ADG-CoT-Qwen2.5-7B/blob/main/README_zh.md">简体中文</a>
	</p>

	<a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a>
	<a href="https://arxiv.org/abs/2604.07892"><img src="https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white" /></a>
	<a href="https://2026.aclweb.org/"><img src="https://img.shields.io/badge/Venue-ACL%202026-blue" /></a>
	[![Task](https://img.shields.io/badge/Task-Data%20Selection-purple.svg)](#overview)
	<img src="https://img.shields.io/badge/Python-3.10%2B-3776AB?logo=python&logoColor=white" />

	ACL 2026 Main Conference

	<a href="https://deepblue666.github.io/">Bo Li</a>, Mingda Wang, Shikun Zhang, Wei Ye

	</div>

	This repository releases the core pipeline of Answer Divergence-Guided Selection (ADG) for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines dispersion magnitude and shape anisotropy, then performs bin-wise selection for semantic coverage.

	---

	## 🌟 Overview

	Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding.

	For each instruction, ADG:

	1. samples multiple answers with relatively high-temperature decoding,
	2. maps answers into a representation space,
	3. computes geometry-aware scores from the sampled answers,
	4. ranks examples by the combined score,
	5. performs proportional selection within semantic bins.

	This repository provides the practical pipeline for:
	- multi-sample answer generation,
	- instruction embedding and clustering,
	- ADG scoring and subset selection,
	- model training,
	- benchmark evaluation,
	- optional task-type analysis.

	---
	Use this model ,you need clone follow repository
	```python
	git clone https://github.com/WisdomShell/ADG.git
	```

	## 📦 What Is Released

	This repository includes the following components:

	### Core selection code
	- `ADG/ADG_llama.py`
	ADG scoring and selection for the LLaMA backbone.

	- `ADG/ADG_qwen.py`
	ADG scoring and selection for the Qwen backbone.

	### Answer generation and instruction embedding
	- `generation/generation.py`
	Generates multiple sampled answers for each instruction.

	- `generation/embedding/embed.py`
	Builds instruction embeddings and performs clustering for bin-wise selection.

	### Training and evaluation
	- `train/train_llama.sh`
	Training entry script for LLaMA.

	- `train/train_qwen.sh`
	Training entry script for Qwen.

	- `train/training/stanford_alpaca/`
	Training utilities and backbone-specific training scripts.

	- `eval/eval.sh`
	Evaluation script based on `lm-evaluation-harness`.

	### Analysis
	- `analysis/analyse.py`
	Optional task-type classification script for analyzing selected data.

	### Environment
	- `requirements.txt`
	Required Python packages for this repository.

	---

	## 🗂️ Repository Structure

	```text
	.
	├── README.md
	├── README_zh.md
	├── requirements.txt
	├── ADG/
	│ ├── ADG_llama.py
	│ └── ADG_qwen.py
	├── generation/
	│ ├── generation.py
	│ └── embedding/
	│ └── embed.py
	├── analysis/
	│ └── analyse.py
	├── eval/
	│ └── eval.sh
	└── train/
	├── train_llama.sh
	├── train_qwen.sh
	└── training/
	└── stanford_alpaca/
	├── train_llama.py
	├── train_qwen.py
	├── utils.py
	└── configs/
	```

	---

	## ⚙️ Installation

	We recommend Python 3.10 or above.

	Example:

	```bash
	conda create -n adg python=3.12.9
	conda activate adg
	pip install -r requirements.txt
	```

	Depending on your environment, you may also need to install GPU-specific packages separately.

	---

	## 🧾 Data Format

	ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below:

	```json
	{
	"id": 0,
	"instruction": "Write a short explanation of transformers.",
	"input": "",
	"output": "Transformers are neural networks based on self-attention..."
	}
	```

	Notes:
	- `id` should uniquely identify each example.
	- `instruction` is required.
	- `input` is optional and can be empty or omitted.
	- `output` is the reference response in the original instruction dataset.
	- Other instruction datasets can be used as long as they are converted into this format.

	After answer generation, the intermediate JSONL file contains records like:

	```json
	{
	"id": 0,
	"instruction": "Write a short explanation of transformers.",
	"output": "Transformers are neural networks based on self-attention...",
	"generated_answers": [
	"...",
	"...",
	"...",
	"...",
	"..."
	]
	}
	```

	---

	## 🔄 Pipeline

	The practical workflow is:

	```text
	instruction pool
	-> generation/generation.py
	-> multi-sample answer JSONL
	-> generation/embedding/embed.py
	-> instruction embeddings + cluster labels
	-> ADG/ADG_llama.py or ADG/ADG_qwen.py
	-> top / middle / bottom selected subsets
	-> train/train_*.sh
	-> finetuned checkpoints
	-> eval/eval.sh
	```

	---

	## 🚀 Quick Start

	### Step 1. Prepare the instruction pool

	Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format.

	### Step 2. Generate multiple answers per instruction

	Before running, update the following variables in `generation/generation.py`:
	- `MODEL_NAME`
	- `OUTPUT_DIR`
	- `OUTPUT_FILE`

	Then run:

	```bash
	cd generation
	torchrun --nproc_per_node=4 --master_port=29500 generation.py --input_file /path/to/your/instruction_data.json --batch_size 32
	```

	### Step 3. Build instruction embeddings and clustering results

	Before running, update the following variables in `generation/embedding/embed.py`:
	- `MODEL_NAME`
	- `INPUT_JSONL`
	- `EMBEDDINGS_PATH`
	- `CLUSTERS_PATH`
	- `K_CLUSTERS`

	Then run:

	```bash
	torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.py
	```

	### Step 4. Run ADG scoring and selection

	Choose the scoring script that matches your backbone.

	For LLaMA, configure these variables in `ADG/ADG_llama.py`:
	- `model_name`
	- `INPUT_JSONL`
	- `OUTPUT_DIR`
	- `EMBEDDINGS_PATH`
	- `CLUSTERS_PATH`
	- `K_CLUSTERS`
	- `FINAL_SELECT_COUNT`

	Then run:

	```bash
	python ADG/ADG_llama.py
	```

	For Qwen, configure these variables in `ADG/ADG_qwen.py`:
	- `model_name`
	- `INPUT_JSONL`
	- `OUTPUT_DIR`
	- `EMBEDDINGS_PATH`
	- `CLUSTERS_PATH`
	- `CHECKPOINT_DIR`
	- `FINAL_SELECT_COUNT`

	Then run:

	```bash
	python ADG/ADG_qwen.py
	```

	The selector saves:
	- `top.json`
	- `middle.json`
	- `bottom.json`

	under the configured `OUTPUT_DIR`.

	### Step 5. Train the backbone model

	Use the selected subset, typically `top.json`, for instruction tuning.

	For LLaMA:

	```bash
	cd train
	bash train_llama.sh
	```

	For Qwen:

	```bash
	cd train
	bash train_qwen.sh
	```

	Before running, update paths such as:
	- `--model_name_or_path`
	- `--data_path`
	- `--output_dir`

	### Step 6. Evaluate the trained checkpoint

	This repository uses `lm-evaluation-harness` for benchmark evaluation.

	Install it first if needed:

	```bash
	git clone https://github.com/EleutherAI/lm-evaluation-harness.git
	cd lm-evaluation-harness
	pip install -e .
	```

	Then configure `MODEL_PATH` and output paths in `eval/eval.sh`, and run:

	```bash
	cd eval
	bash eval.sh
	```

	The evaluation script currently includes:
	- BBH
	- GSM8K
	- MMLU
	- TruthfulQA
	- MBPP
	- HumanEval

	---

	## 📊 ADG Scoring Intuition

	ADG is built around two complementary signals derived from multiple sampled answers:

	- Dispersion magnitude
	Measures how widely the sampled answers spread in representation space.

	- Shape anisotropy
	Measures whether the spread is multi-directional rather than dominated by a single direction.

	The final ADG score combines these two parts, and the selected subset is obtained through semantic bin-wise ranking. This design helps avoid collapsing selection into only a few dense instruction regions.

	---

	## 🛠️ Script Notes

	### `generation/generation.py`
	Main functionality:
	- load the base model,
	- sample multiple answers for each instruction,
	- save generated answers in JSONL format,
	- support distributed generation.

	### `generation/embedding/embed.py`
	Main functionality:
	- build instruction embeddings,
	- run clustering,
	- save instruction embeddings and cluster labels,
	- provide the semantic bins used by ADG selection.

	### `ADG/ADG_llama.py`
	Main functionality:
	- read the generated-answer JSONL file,
	- compute answer-geometry metrics,
	- combine metrics into the ADG score,
	- perform proportional cluster-based selection,
	- save `top.json`, `middle.json`, and `bottom.json`.

	### `ADG/ADG_qwen.py`
	Main functionality:
	- compute ADG metrics for Qwen-generated answers,
	- support checkpoint-based resumption,
	- perform the same top / middle / bottom selection pipeline.

	### `analysis/analyse.py`
	Main functionality:
	- classify instructions into coarse task categories,
	- support optional data-level analysis of selected subsets.

	### `train/train_llama.sh` and `train/train_qwen.sh`
	Main functionality:
	- launch distributed full fine-tuning,
	- use the selected subset for instruction tuning.

	### `eval/eval.sh`
	Main functionality:
	- run benchmark evaluation with `lm-evaluation-harness`,
	- support reasoning, knowledge, and coding tasks.

	---

	## ❓ Common Issues

	### 1. Path configuration is not updated
	Most scripts use placeholder paths. Update all required paths before running.

	### 2. Inconsistent model and intermediate files
	Make sure the generation backbone, embedding backbone, ADG scoring script, and training script are aligned.

	### 3. Missing intermediate files
	The selector depends on:
	- generated answer JSONL,
	- instruction embeddings,
	- clustering results.

	Run the previous stages before starting ADG selection.

	### 4. GPU memory pressure
	Generation, embedding, and scoring all use hidden-state-based processing. You may need to reduce batch size or adjust GPU allocation depending on your hardware.

	### 5. Evaluation dependency is not installed
	`eval/eval.sh` depends on `lm-evaluation-harness`. Install it separately before running evaluation.

	---

	## 📖 Citation

	If you use this repository, please cite the paper.

	---