| --- |
| license: apache-2.0 |
| language: |
| - zh |
| - en |
| base_model: |
| - meta-llama/Meta-Llama-3-8B |
| tags: |
| - ADG |
| - SFT |
| --- |
| |
| <div align="center"> |
| |
| <h1>Instruction Data Selection via Answer Divergence<h1> |
| <p> |
| <strong>English</strong> | <a href="https://huggingface.co/WisdomShell/ADG-Alpaca-GPT4-LLaMa3-8B/blob/main/README_zh.md">简体中文</a> |
| </p> |
|
|
| <a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a> |
| <a href="https://arxiv.org/abs/2604.10448"><img src="https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white" /></a> |
| <a href="https://2026.aclweb.org/"><img src="https://img.shields.io/badge/Venue-ACL%202026-blue" /></a> |
| [](#overview) |
| <img src="https://img.shields.io/badge/Python-3.10%2B-3776AB?logo=python&logoColor=white" /> |
|
|
| **ACL 2026 Main Conference** |
|
|
| <a href="https://deepblue666.github.io/">Bo Li</a>, Mingda Wang, Shikun Zhang, Wei Ye |
|
|
| </div> |
|
|
| This repository releases the core pipeline of **Answer Divergence-Guided Selection (ADG)** for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines **dispersion magnitude** and **shape anisotropy**, then performs **bin-wise selection** for semantic coverage. |
|
|
| --- |
|
|
| ## 🌟 Overview |
|
|
| Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding. |
|
|
| For each instruction, ADG: |
|
|
| 1. samples multiple answers with relatively high-temperature decoding, |
| 2. maps answers into a representation space, |
| 3. computes geometry-aware scores from the sampled answers, |
| 4. ranks examples by the combined score, |
| 5. performs proportional selection within semantic bins. |
|
|
| This repository provides the practical pipeline for: |
| - multi-sample answer generation, |
| - instruction embedding and clustering, |
| - ADG scoring and subset selection, |
| - model training, |
| - benchmark evaluation, |
| - optional task-type analysis. |
|
|
|
|
| ## ⚙️ Installation |
|
|
| We recommend Python 3.10 or above. |
|
|
| Example: |
|
|
| ```bash |
| git clone https://github.com/WisdomShell/ADG.git |
| conda create -n adg python=3.12.9 |
| conda activate adg |
| pip install -r requirements.txt |
| ``` |
|
|
| Depending on your environment, you may also need to install GPU-specific packages separately. |
|
|
| --- |
|
|
| ## 🧾 Data Format |
|
|
| ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below: |
|
|
| ```json |
| { |
| "id": 0, |
| "instruction": "Write a short explanation of transformers.", |
| "input": "", |
| "output": "Transformers are neural networks based on self-attention..." |
| } |
| ``` |
|
|
| Notes: |
| - `id` should uniquely identify each example. |
| - `instruction` is required. |
| - `input` is optional and can be empty or omitted. |
| - `output` is the reference response in the original instruction dataset. |
| - Other instruction datasets can be used as long as they are converted into this format. |
|
|
| After answer generation, the intermediate JSONL file contains records like: |
|
|
| ```json |
| { |
| "id": 0, |
| "instruction": "Write a short explanation of transformers.", |
| "output": "Transformers are neural networks based on self-attention...", |
| "generated_answers": [ |
| "...", |
| "...", |
| "...", |
| "...", |
| "..." |
| ] |
| } |
| ``` |
|
|
| --- |
|
|
| ## 🚀 Quick Start |
|
|
| ### Step 1. Prepare the instruction pool |
|
|
| Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format. |
|
|
| ### Step 2. Generate multiple answers per instruction |
|
|
| Before running, update the following variables in `generation/generation.py`: |
| - `MODEL_NAME` |
| - `OUTPUT_DIR` |
| - `OUTPUT_FILE` |
|
|
| Then run: |
|
|
| ```bash |
| cd generation |
| torchrun --nproc_per_node=4 --master_port=29500 generation.py --input_file /path/to/your/instruction_data.json --batch_size 32 |
| ``` |
|
|
| ### Step 3. Build instruction embeddings and clustering results |
|
|
| Before running, update the following variables in `generation/embedding/embed.py`: |
| - `MODEL_NAME` |
| - `INPUT_JSONL` |
| - `EMBEDDINGS_PATH` |
| - `CLUSTERS_PATH` |
| - `K_CLUSTERS` |
|
|
| Then run: |
|
|
| ```bash |
| torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.py |
| ``` |
|
|
| ### Step 4. Run ADG scoring and selection |
|
|
| Choose the scoring script that matches your backbone. |
|
|
| For LLaMA, configure these variables in `ADG/ADG_llama.py`: |
| - `model_name` |
| - `INPUT_JSONL` |
| - `OUTPUT_DIR` |
| - `EMBEDDINGS_PATH` |
| - `CLUSTERS_PATH` |
| - `K_CLUSTERS` |
| - `FINAL_SELECT_COUNT` |
|
|
| Then run: |
|
|
| ```bash |
| python ADG/ADG_llama.py |
| ``` |
|
|
| For Qwen, configure these variables in `ADG/ADG_qwen.py`: |
| - `model_name` |
| - `INPUT_JSONL` |
| - `OUTPUT_DIR` |
| - `EMBEDDINGS_PATH` |
| - `CLUSTERS_PATH` |
| - `CHECKPOINT_DIR` |
| - `FINAL_SELECT_COUNT` |
|
|
| Then run: |
|
|
| ```bash |
| python ADG/ADG_qwen.py |
| ``` |
|
|
| The selector saves: |
| - `top.json` |
| - `middle.json` |
| - `bottom.json` |
|
|
| under the configured `OUTPUT_DIR`. |
|
|
| ### Step 5. Train the backbone model |
|
|
| Use the selected subset, typically `top.json`, for instruction tuning. |
|
|
| For LLaMA: |
|
|
| ```bash |
| cd train |
| bash train_llama.sh |
| ``` |
|
|
| For Qwen: |
|
|
| ```bash |
| cd train |
| bash train_qwen.sh |
| ``` |
|
|
| Before running, update paths such as: |
| - `--model_name_or_path` |
| - `--data_path` |
| - `--output_dir` |
|
|
| ### Step 6. Evaluate the trained checkpoint |
|
|
| This repository uses `lm-evaluation-harness` for benchmark evaluation. |
|
|
| Install it first if needed: |
|
|
| ```bash |
| git clone https://github.com/EleutherAI/lm-evaluation-harness.git |
| cd lm-evaluation-harness |
| pip install -e . |
| ``` |
|
|
| Then configure `MODEL_PATH` and output paths in `eval/eval.sh`, and run: |
|
|
| ```bash |
| cd eval |
| bash eval.sh |
| ``` |
|
|
| The evaluation script currently includes: |
| - BBH |
| - GSM8K |
| - MMLU |
| - TruthfulQA |
| - MBPP |
| - HumanEval |
|
|
| --- |
|
|
| ## 📖 Citation |
|
|
| ```bibtex |
| @article{li2026instruction, |
| title={Instruction Data Selection via Answer Divergence}, |
| author={Li, Bo and Wang, Mingda and Zhang, Shikun and Ye, Wei}, |
| journal={arXiv preprint arXiv:2604.10448}, |
| year={2026} |
| } |
| ``` |
|
|
| --- |