Update README.md

49fbc80 verified 2 days ago

6.21 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	base_model:
	- meta-llama/Meta-Llama-3-8B
	tags:
	- ADG
	- SFT
	---

	<div align="center">

	<h1>Instruction Data Selection via Answer Divergence<h1>
	<p>
	<strong>English</strong> \| <a href="https://huggingface.co/WisdomShell/ADG-Alpaca-GPT4-LLaMa3-8B/blob/main/README_zh.md">简体中文</a>
	</p>

	<a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a>
	<a href="https://arxiv.org/abs/2604.10448"><img src="https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white" /></a>
	<a href="https://2026.aclweb.org/"><img src="https://img.shields.io/badge/Venue-ACL%202026-blue" /></a>
	[![Task](https://img.shields.io/badge/Task-Data%20Selection-purple.svg)](#overview)
	<img src="https://img.shields.io/badge/Python-3.10%2B-3776AB?logo=python&logoColor=white" />

	ACL 2026 Main Conference

	<a href="https://deepblue666.github.io/">Bo Li</a>, Mingda Wang, Shikun Zhang, Wei Ye

	</div>

	This repository releases the core pipeline of Answer Divergence-Guided Selection (ADG) for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines dispersion magnitude and shape anisotropy, then performs bin-wise selection for semantic coverage.

	---

	## 🌟 Overview

	Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding.

	For each instruction, ADG:

	1. samples multiple answers with relatively high-temperature decoding,
	2. maps answers into a representation space,
	3. computes geometry-aware scores from the sampled answers,
	4. ranks examples by the combined score,
	5. performs proportional selection within semantic bins.

	This repository provides the practical pipeline for:
	- multi-sample answer generation,
	- instruction embedding and clustering,
	- ADG scoring and subset selection,
	- model training,
	- benchmark evaluation,
	- optional task-type analysis.


	## ⚙️ Installation

	We recommend Python 3.10 or above.

	Example:

	```bash
	git clone https://github.com/WisdomShell/ADG.git
	conda create -n adg python=3.12.9
	conda activate adg
	pip install -r requirements.txt
	```

	Depending on your environment, you may also need to install GPU-specific packages separately.

	---

	## 🧾 Data Format

	ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below:

	```json
	{
	"id": 0,
	"instruction": "Write a short explanation of transformers.",
	"input": "",
	"output": "Transformers are neural networks based on self-attention..."
	}
	```

	Notes:
	- `id` should uniquely identify each example.
	- `instruction` is required.
	- `input` is optional and can be empty or omitted.
	- `output` is the reference response in the original instruction dataset.
	- Other instruction datasets can be used as long as they are converted into this format.

	After answer generation, the intermediate JSONL file contains records like:

	```json
	{
	"id": 0,
	"instruction": "Write a short explanation of transformers.",
	"output": "Transformers are neural networks based on self-attention...",
	"generated_answers": [
	"...",
	"...",
	"...",
	"...",
	"..."
	]
	}
	```

	---

	## 🚀 Quick Start

	### Step 1. Prepare the instruction pool

	Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format.

	### Step 2. Generate multiple answers per instruction

	Before running, update the following variables in `generation/generation.py`:
	- `MODEL_NAME`
	- `OUTPUT_DIR`
	- `OUTPUT_FILE`

	Then run:

	```bash
	cd generation
	torchrun --nproc_per_node=4 --master_port=29500 generation.py --input_file /path/to/your/instruction_data.json --batch_size 32
	```

	### Step 3. Build instruction embeddings and clustering results

	Before running, update the following variables in `generation/embedding/embed.py`:
	- `MODEL_NAME`
	- `INPUT_JSONL`
	- `EMBEDDINGS_PATH`
	- `CLUSTERS_PATH`
	- `K_CLUSTERS`

	Then run:

	```bash
	torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.py
	```

	### Step 4. Run ADG scoring and selection

	Choose the scoring script that matches your backbone.

	For LLaMA, configure these variables in `ADG/ADG_llama.py`:
	- `model_name`
	- `INPUT_JSONL`
	- `OUTPUT_DIR`
	- `EMBEDDINGS_PATH`
	- `CLUSTERS_PATH`
	- `K_CLUSTERS`
	- `FINAL_SELECT_COUNT`

	Then run:

	```bash
	python ADG/ADG_llama.py
	```

	For Qwen, configure these variables in `ADG/ADG_qwen.py`:
	- `model_name`
	- `INPUT_JSONL`
	- `OUTPUT_DIR`
	- `EMBEDDINGS_PATH`
	- `CLUSTERS_PATH`
	- `CHECKPOINT_DIR`
	- `FINAL_SELECT_COUNT`

	Then run:

	```bash
	python ADG/ADG_qwen.py
	```

	The selector saves:
	- `top.json`
	- `middle.json`
	- `bottom.json`

	under the configured `OUTPUT_DIR`.

	### Step 5. Train the backbone model

	Use the selected subset, typically `top.json`, for instruction tuning.

	For LLaMA:

	```bash
	cd train
	bash train_llama.sh
	```

	For Qwen:

	```bash
	cd train
	bash train_qwen.sh
	```

	Before running, update paths such as:
	- `--model_name_or_path`
	- `--data_path`
	- `--output_dir`

	### Step 6. Evaluate the trained checkpoint

	This repository uses `lm-evaluation-harness` for benchmark evaluation.

	Install it first if needed:

	```bash
	git clone https://github.com/EleutherAI/lm-evaluation-harness.git
	cd lm-evaluation-harness
	pip install -e .
	```

	Then configure `MODEL_PATH` and output paths in `eval/eval.sh`, and run:

	```bash
	cd eval
	bash eval.sh
	```

	The evaluation script currently includes:
	- BBH
	- GSM8K
	- MMLU
	- TruthfulQA
	- MBPP
	- HumanEval

	---

	## 📖 Citation

	```bibtex
	@article{li2026instruction,
	title={Instruction Data Selection via Answer Divergence},
	author={Li, Bo and Wang, Mingda and Zhang, Shikun and Ye, Wei},
	journal={arXiv preprint arXiv:2604.10448},
	year={2026}
	}
	```

	---