--- license: apache-2.0 language: - zh - en base_model: - meta-llama/Meta-Llama-3-8B tags: - ADG - SFT ---

Instruction Data Selection via Answer Divergence

English | 简体中文
[![Task](https://img.shields.io/badge/Task-Data%20Selection-purple.svg)](#overview) ACL 2026 Main Conference Bo Li, Mingda Wang, Shikun Zhang, Wei Ye

This repository releases the core pipeline of **Answer Divergence-Guided Selection (ADG)** for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines **dispersion magnitude** and **shape anisotropy**, then performs **bin-wise selection** for semantic coverage. --- ## 🌟 Overview Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding. For each instruction, ADG: 1. samples multiple answers with relatively high-temperature decoding, 2. maps answers into a representation space, 3. computes geometry-aware scores from the sampled answers, 4. ranks examples by the combined score, 5. performs proportional selection within semantic bins. This repository provides the practical pipeline for: - multi-sample answer generation, - instruction embedding and clustering, - ADG scoring and subset selection, - model training, - benchmark evaluation, - optional task-type analysis. ## ⚙️ Installation We recommend Python 3.10 or above. Example: ```bash git clone https://github.com/WisdomShell/ADG.git conda create -n adg python=3.12.9 conda activate adg pip install -r requirements.txt ``` Depending on your environment, you may also need to install GPU-specific packages separately. --- ## 🧾 Data Format ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below: ```json { "id": 0, "instruction": "Write a short explanation of transformers.", "input": "", "output": "Transformers are neural networks based on self-attention..." } ``` Notes: - `id` should uniquely identify each example. - `instruction` is required. - `input` is optional and can be empty or omitted. - `output` is the reference response in the original instruction dataset. - Other instruction datasets can be used as long as they are converted into this format. After answer generation, the intermediate JSONL file contains records like: ```json { "id": 0, "instruction": "Write a short explanation of transformers.", "output": "Transformers are neural networks based on self-attention...", "generated_answers": [ "...", "...", "...", "...", "..." ] } ``` --- ## 🚀 Quick Start ### Step 1. Prepare the instruction pool Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format. ### Step 2. Generate multiple answers per instruction Before running, update the following variables in `generation/generation.py`: - `MODEL_NAME` - `OUTPUT_DIR` - `OUTPUT_FILE` Then run: ```bash cd generation torchrun --nproc_per_node=4 --master_port=29500 generation.py --input_file /path/to/your/instruction_data.json --batch_size 32 ``` ### Step 3. Build instruction embeddings and clustering results Before running, update the following variables in `generation/embedding/embed.py`: - `MODEL_NAME` - `INPUT_JSONL` - `EMBEDDINGS_PATH` - `CLUSTERS_PATH` - `K_CLUSTERS` Then run: ```bash torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.py ``` ### Step 4. Run ADG scoring and selection Choose the scoring script that matches your backbone. For LLaMA, configure these variables in `ADG/ADG_llama.py`: - `model_name` - `INPUT_JSONL` - `OUTPUT_DIR` - `EMBEDDINGS_PATH` - `CLUSTERS_PATH` - `K_CLUSTERS` - `FINAL_SELECT_COUNT` Then run: ```bash python ADG/ADG_llama.py ``` For Qwen, configure these variables in `ADG/ADG_qwen.py`: - `model_name` - `INPUT_JSONL` - `OUTPUT_DIR` - `EMBEDDINGS_PATH` - `CLUSTERS_PATH` - `CHECKPOINT_DIR` - `FINAL_SELECT_COUNT` Then run: ```bash python ADG/ADG_qwen.py ``` The selector saves: - `top.json` - `middle.json` - `bottom.json` under the configured `OUTPUT_DIR`. ### Step 5. Train the backbone model Use the selected subset, typically `top.json`, for instruction tuning. For LLaMA: ```bash cd train bash train_llama.sh ``` For Qwen: ```bash cd train bash train_qwen.sh ``` Before running, update paths such as: - `--model_name_or_path` - `--data_path` - `--output_dir` ### Step 6. Evaluate the trained checkpoint This repository uses `lm-evaluation-harness` for benchmark evaluation. Install it first if needed: ```bash git clone https://github.com/EleutherAI/lm-evaluation-harness.git cd lm-evaluation-harness pip install -e . ``` Then configure `MODEL_PATH` and output paths in `eval/eval.sh`, and run: ```bash cd eval bash eval.sh ``` The evaluation script currently includes: - BBH - GSM8K - MMLU - TruthfulQA - MBPP - HumanEval --- ## 📖 Citation ```bibtex @article{li2026instruction, title={Instruction Data Selection via Answer Divergence}, author={Li, Bo and Wang, Mingda and Zhang, Shikun and Ye, Wei}, journal={arXiv preprint arXiv:2604.10448}, year={2026} } ``` ---

Instruction Data Selection via Answer Divergence

English | 简体中文 [![Task](https://img.shields.io/badge/Task-Data%20Selection-purple.svg)](#overview) **ACL 2026 Main Conference** Bo Li, Mingda Wang, Shikun Zhang, Wei Ye

English | 简体中文
[![Task](https://img.shields.io/badge/Task-Data%20Selection-purple.svg)](#overview) ACL 2026 Main Conference Bo Li, Mingda Wang, Shikun Zhang, Wei Ye