Instruction Data Selection via Answer Divergence

---
license: apache-2.0
language:
- zh
- en
base_model:
- Qwen/Qwen2.5-7B
tags:
- ADG
- SFT
---

<div align="center">
  
<h1>Instruction Data Selection via Answer Divergence<h1>
  <p>
  <strong>English</strong> | <a href="https://huggingface.co/WisdomShell/ADG-CoT-Qwen2.5-7B/blob/main/README_zh.md">简体中文</a>
</p>

<a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a>
<a href="https://arxiv.org/abs/2604.07892"><img src="https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white" /></a>
<a href="https://2026.aclweb.org/"><img src="https://img.shields.io/badge/Venue-ACL%202026-blue" /></a>
[![Task](https://img.shields.io/badge/Task-Data%20Selection-purple.svg)](#overview)
<img src="https://img.shields.io/badge/Python-3.10%2B-3776AB?logo=python&logoColor=white" />

**ACL 2026 Main Conference**

<a href="https://deepblue666.github.io/">Bo Li</a>, Mingda Wang, Shikun Zhang, Wei Ye

</div>

This repository releases the core pipeline of **Answer Divergence-Guided Selection (ADG)** for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines **dispersion magnitude** and **shape anisotropy**, then performs **bin-wise selection** for semantic coverage. 

---

## 🌟 Overview

Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding.

For each instruction, ADG:

1. samples multiple answers with relatively high-temperature decoding,
2. maps answers into a representation space,
3. computes geometry-aware scores from the sampled answers,
4. ranks examples by the combined score,
5. performs proportional selection within semantic bins.

This repository provides the practical pipeline for:
- multi-sample answer generation,
- instruction embedding and clustering,
- ADG scoring and subset selection,
- model training,
- benchmark evaluation,
- optional task-type analysis.

---
Use this model ,you need clone follow repository
```python
git clone https://github.com/WisdomShell/ADG.git
```

## 📦 What Is Released

This repository includes the following components:

### Core selection code
- `ADG/ADG_llama.py`  
  ADG scoring and selection for the LLaMA backbone.

- `ADG/ADG_qwen.py`  
  ADG scoring and selection for the Qwen backbone.

### Answer generation and instruction embedding
- `generation/generation.py`  
  Generates multiple sampled answers for each instruction.

- `generation/embedding/embed.py`  
  Builds instruction embeddings and performs clustering for bin-wise selection.

### Training and evaluation
- `train/train_llama.sh`  
  Training entry script for LLaMA.

- `train/train_qwen.sh`  
  Training entry script for Qwen.

- `train/training/stanford_alpaca/`  
  Training utilities and backbone-specific training scripts.

- `eval/eval.sh`  
  Evaluation script based on `lm-evaluation-harness`.

### Analysis
- `analysis/analyse.py`  
  Optional task-type classification script for analyzing selected data.

### Environment
- `requirements.txt`  
  Required Python packages for this repository.

---

## 🗂️ Repository Structure

```text
.
├── README.md
├── README_zh.md
├── requirements.txt
├── ADG/
│   ├── ADG_llama.py
│   └── ADG_qwen.py
├── generation/
│   ├── generation.py
│   └── embedding/
│       └── embed.py
├── analysis/
│   └── analyse.py
├── eval/
│   └── eval.sh
└── train/
    ├── train_llama.sh
    ├── train_qwen.sh
    └── training/
        └── stanford_alpaca/
            ├── train_llama.py
            ├── train_qwen.py
            ├── utils.py
            └── configs/
```

---

## ⚙️ Installation

We recommend Python 3.10 or above.

Example:

```bash
conda create -n adg python=3.12.9
conda activate adg
pip install -r requirements.txt
```

Depending on your environment, you may also need to install GPU-specific packages separately.

---

## 🧾 Data Format

ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below:

```json
{
  "id": 0,
  "instruction": "Write a short explanation of transformers.",
  "input": "",
  "output": "Transformers are neural networks based on self-attention..."
}
```

Notes:
- `id` should uniquely identify each example.
- `instruction` is required.
- `input` is optional and can be empty or omitted.
- `output` is the reference response in the original instruction dataset.
- Other instruction datasets can be used as long as they are converted into this format.

After answer generation, the intermediate JSONL file contains records like:

```json
{
  "id": 0,
  "instruction": "Write a short explanation of transformers.",
  "output": "Transformers are neural networks based on self-attention...",
  "generated_answers": [
    "...",
    "...",
    "...",
    "...",
    "..."
  ]
}
```

---

## 🔄 Pipeline

The practical workflow is:

```text
instruction pool
    -> generation/generation.py
    -> multi-sample answer JSONL
    -> generation/embedding/embed.py
    -> instruction embeddings + cluster labels
    -> ADG/ADG_llama.py or ADG/ADG_qwen.py
    -> top / middle / bottom selected subsets
    -> train/train_*.sh
    -> finetuned checkpoints
    -> eval/eval.sh
```

---

## 🚀 Quick Start

### Step 1. Prepare the instruction pool

Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format.

### Step 2. Generate multiple answers per instruction

Before running, update the following variables in `generation/generation.py`:
- `MODEL_NAME`
- `OUTPUT_DIR`
- `OUTPUT_FILE`

Then run:

```bash
cd generation
torchrun --nproc_per_node=4 --master_port=29500 generation.py   --input_file /path/to/your/instruction_data.json   --batch_size 32
```

### Step 3. Build instruction embeddings and clustering results

Before running, update the following variables in `generation/embedding/embed.py`:
- `MODEL_NAME`
- `INPUT_JSONL`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `K_CLUSTERS`

Then run:

```bash
torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.py
```

### Step 4. Run ADG scoring and selection

Choose the scoring script that matches your backbone.

For LLaMA, configure these variables in `ADG/ADG_llama.py`:
- `model_name`
- `INPUT_JSONL`
- `OUTPUT_DIR`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `K_CLUSTERS`
- `FINAL_SELECT_COUNT`

Then run:

```bash
python ADG/ADG_llama.py
```

For Qwen, configure these variables in `ADG/ADG_qwen.py`:
- `model_name`
- `INPUT_JSONL`
- `OUTPUT_DIR`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `CHECKPOINT_DIR`
- `FINAL_SELECT_COUNT`

Then run:

```bash
python ADG/ADG_qwen.py
```

The selector saves:
- `top.json`
- `middle.json`
- `bottom.json`

under the configured `OUTPUT_DIR`.

### Step 5. Train the backbone model

Use the selected subset, typically `top.json`, for instruction tuning.

For LLaMA:

```bash
cd train
bash train_llama.sh
```

For Qwen:

```bash
cd train
bash train_qwen.sh
```

Before running, update paths such as:
- `--model_name_or_path`
- `--data_path`
- `--output_dir`

### Step 6. Evaluate the trained checkpoint

This repository uses `lm-evaluation-harness` for benchmark evaluation.

Install it first if needed:

```bash
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
```

Then configure `MODEL_PATH` and output paths in `eval/eval.sh`, and run:

```bash
cd eval
bash eval.sh
```

The evaluation script currently includes:
- BBH
- GSM8K
- MMLU
- TruthfulQA
- MBPP
- HumanEval

---

## 📊 ADG Scoring Intuition

ADG is built around two complementary signals derived from multiple sampled answers:

- **Dispersion magnitude**  
  Measures how widely the sampled answers spread in representation space.

- **Shape anisotropy**  
  Measures whether the spread is multi-directional rather than dominated by a single direction.

The final ADG score combines these two parts, and the selected subset is obtained through semantic bin-wise ranking. This design helps avoid collapsing selection into only a few dense instruction regions.

---

## 🛠️ Script Notes

### `generation/generation.py`
Main functionality:
- load the base model,
- sample multiple answers for each instruction,
- save generated answers in JSONL format,
- support distributed generation.

### `generation/embedding/embed.py`
Main functionality:
- build instruction embeddings,
- run clustering,
- save instruction embeddings and cluster labels,
- provide the semantic bins used by ADG selection.

### `ADG/ADG_llama.py`
Main functionality:
- read the generated-answer JSONL file,
- compute answer-geometry metrics,
- combine metrics into the ADG score,
- perform proportional cluster-based selection,
- save `top.json`, `middle.json`, and `bottom.json`.

### `ADG/ADG_qwen.py`
Main functionality:
- compute ADG metrics for Qwen-generated answers,
- support checkpoint-based resumption,
- perform the same top / middle / bottom selection pipeline.

### `analysis/analyse.py`
Main functionality:
- classify instructions into coarse task categories,
- support optional data-level analysis of selected subsets.

### `train/train_llama.sh` and `train/train_qwen.sh`
Main functionality:
- launch distributed full fine-tuning,
- use the selected subset for instruction tuning.

### `eval/eval.sh`
Main functionality:
- run benchmark evaluation with `lm-evaluation-harness`,
- support reasoning, knowledge, and coding tasks.

---

## ❓ Common Issues

### 1. Path configuration is not updated
Most scripts use placeholder paths. Update all required paths before running.

### 2. Inconsistent model and intermediate files
Make sure the generation backbone, embedding backbone, ADG scoring script, and training script are aligned.

### 3. Missing intermediate files
The selector depends on:
- generated answer JSONL,
- instruction embeddings,
- clustering results.

Run the previous stages before starting ADG selection. 

### 4. GPU memory pressure
Generation, embedding, and scoring all use hidden-state-based processing. You may need to reduce batch size or adjust GPU allocation depending on your hardware.

### 5. Evaluation dependency is not installed
`eval/eval.sh` depends on `lm-evaluation-harness`. Install it separately before running evaluation.

---

## 📖 Citation

If you use this repository, please cite the paper.

---