ADG-CoT-LLaMa3-8B / README.md
WarrenWang01's picture
Update README.md
49fbc80 verified
---
license: apache-2.0
language:
- zh
- en
base_model:
- meta-llama/Meta-Llama-3-8B
tags:
- ADG
- SFT
---
<div align="center">
<h1>Instruction Data Selection via Answer Divergence<h1>
<p>
<strong>English</strong> | <a href="https://huggingface.co/WisdomShell/ADG-Alpaca-GPT4-LLaMa3-8B/blob/main/README_zh.md">简体中文</a>
</p>
<a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a>
<a href="https://arxiv.org/abs/2604.10448"><img src="https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white" /></a>
<a href="https://2026.aclweb.org/"><img src="https://img.shields.io/badge/Venue-ACL%202026-blue" /></a>
[![Task](https://img.shields.io/badge/Task-Data%20Selection-purple.svg)](#overview)
<img src="https://img.shields.io/badge/Python-3.10%2B-3776AB?logo=python&logoColor=white" />
**ACL 2026 Main Conference**
<a href="https://deepblue666.github.io/">Bo Li</a>, Mingda Wang, Shikun Zhang, Wei Ye
</div>
This repository releases the core pipeline of **Answer Divergence-Guided Selection (ADG)** for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines **dispersion magnitude** and **shape anisotropy**, then performs **bin-wise selection** for semantic coverage.
---
## 🌟 Overview
Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding.
For each instruction, ADG:
1. samples multiple answers with relatively high-temperature decoding,
2. maps answers into a representation space,
3. computes geometry-aware scores from the sampled answers,
4. ranks examples by the combined score,
5. performs proportional selection within semantic bins.
This repository provides the practical pipeline for:
- multi-sample answer generation,
- instruction embedding and clustering,
- ADG scoring and subset selection,
- model training,
- benchmark evaluation,
- optional task-type analysis.
## ⚙️ Installation
We recommend Python 3.10 or above.
Example:
```bash
git clone https://github.com/WisdomShell/ADG.git
conda create -n adg python=3.12.9
conda activate adg
pip install -r requirements.txt
```
Depending on your environment, you may also need to install GPU-specific packages separately.
---
## 🧾 Data Format
ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below:
```json
{
"id": 0,
"instruction": "Write a short explanation of transformers.",
"input": "",
"output": "Transformers are neural networks based on self-attention..."
}
```
Notes:
- `id` should uniquely identify each example.
- `instruction` is required.
- `input` is optional and can be empty or omitted.
- `output` is the reference response in the original instruction dataset.
- Other instruction datasets can be used as long as they are converted into this format.
After answer generation, the intermediate JSONL file contains records like:
```json
{
"id": 0,
"instruction": "Write a short explanation of transformers.",
"output": "Transformers are neural networks based on self-attention...",
"generated_answers": [
"...",
"...",
"...",
"...",
"..."
]
}
```
---
## 🚀 Quick Start
### Step 1. Prepare the instruction pool
Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format.
### Step 2. Generate multiple answers per instruction
Before running, update the following variables in `generation/generation.py`:
- `MODEL_NAME`
- `OUTPUT_DIR`
- `OUTPUT_FILE`
Then run:
```bash
cd generation
torchrun --nproc_per_node=4 --master_port=29500 generation.py --input_file /path/to/your/instruction_data.json --batch_size 32
```
### Step 3. Build instruction embeddings and clustering results
Before running, update the following variables in `generation/embedding/embed.py`:
- `MODEL_NAME`
- `INPUT_JSONL`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `K_CLUSTERS`
Then run:
```bash
torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.py
```
### Step 4. Run ADG scoring and selection
Choose the scoring script that matches your backbone.
For LLaMA, configure these variables in `ADG/ADG_llama.py`:
- `model_name`
- `INPUT_JSONL`
- `OUTPUT_DIR`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `K_CLUSTERS`
- `FINAL_SELECT_COUNT`
Then run:
```bash
python ADG/ADG_llama.py
```
For Qwen, configure these variables in `ADG/ADG_qwen.py`:
- `model_name`
- `INPUT_JSONL`
- `OUTPUT_DIR`
- `EMBEDDINGS_PATH`
- `CLUSTERS_PATH`
- `CHECKPOINT_DIR`
- `FINAL_SELECT_COUNT`
Then run:
```bash
python ADG/ADG_qwen.py
```
The selector saves:
- `top.json`
- `middle.json`
- `bottom.json`
under the configured `OUTPUT_DIR`.
### Step 5. Train the backbone model
Use the selected subset, typically `top.json`, for instruction tuning.
For LLaMA:
```bash
cd train
bash train_llama.sh
```
For Qwen:
```bash
cd train
bash train_qwen.sh
```
Before running, update paths such as:
- `--model_name_or_path`
- `--data_path`
- `--output_dir`
### Step 6. Evaluate the trained checkpoint
This repository uses `lm-evaluation-harness` for benchmark evaluation.
Install it first if needed:
```bash
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
```
Then configure `MODEL_PATH` and output paths in `eval/eval.sh`, and run:
```bash
cd eval
bash eval.sh
```
The evaluation script currently includes:
- BBH
- GSM8K
- MMLU
- TruthfulQA
- MBPP
- HumanEval
---
## 📖 Citation
```bibtex
@article{li2026instruction,
title={Instruction Data Selection via Answer Divergence},
author={Li, Bo and Wang, Mingda and Zhang, Shikun and Ye, Wei},
journal={arXiv preprint arXiv:2604.10448},
year={2026}
}
```
---