--- license: apache-2.0 language: - zh - en base_model: - Qwen/Qwen2.5-7B tags: - ADG - SFT ---

Instruction Data Selection via Answer Divergence

English | ็ฎ€ไฝ“ไธญๆ–‡

[![Task](https://img.shields.io/badge/Task-Data%20Selection-purple.svg)](#overview) **ACL 2026 Main Conference** Bo Li, Mingda Wang, Shikun Zhang, Wei Ye

This repository releases the core pipeline of **Answer Divergence-Guided Selection (ADG)** for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines **dispersion magnitude** and **shape anisotropy**, then performs **bin-wise selection** for semantic coverage. --- ## ๐ŸŒŸ Overview Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding. For each instruction, ADG: 1. samples multiple answers with relatively high-temperature decoding, 2. maps answers into a representation space, 3. computes geometry-aware scores from the sampled answers, 4. ranks examples by the combined score, 5. performs proportional selection within semantic bins. This repository provides the practical pipeline for: - multi-sample answer generation, - instruction embedding and clustering, - ADG scoring and subset selection, - model training, - benchmark evaluation, - optional task-type analysis. --- Use this model ,you need clone follow repository ```python git clone https://github.com/WisdomShell/ADG.git ``` ## ๐Ÿ“ฆ What Is Released This repository includes the following components: ### Core selection code - `ADG/ADG_llama.py` ADG scoring and selection for the LLaMA backbone. - `ADG/ADG_qwen.py` ADG scoring and selection for the Qwen backbone. ### Answer generation and instruction embedding - `generation/generation.py` Generates multiple sampled answers for each instruction. - `generation/embedding/embed.py` Builds instruction embeddings and performs clustering for bin-wise selection. ### Training and evaluation - `train/train_llama.sh` Training entry script for LLaMA. - `train/train_qwen.sh` Training entry script for Qwen. - `train/training/stanford_alpaca/` Training utilities and backbone-specific training scripts. - `eval/eval.sh` Evaluation script based on `lm-evaluation-harness`. ### Analysis - `analysis/analyse.py` Optional task-type classification script for analyzing selected data. ### Environment - `requirements.txt` Required Python packages for this repository. --- ## ๐Ÿ—‚๏ธ Repository Structure ```text . โ”œโ”€โ”€ README.md โ”œโ”€โ”€ README_zh.md โ”œโ”€โ”€ requirements.txt โ”œโ”€โ”€ ADG/ โ”‚ โ”œโ”€โ”€ ADG_llama.py โ”‚ โ””โ”€โ”€ ADG_qwen.py โ”œโ”€โ”€ generation/ โ”‚ โ”œโ”€โ”€ generation.py โ”‚ โ””โ”€โ”€ embedding/ โ”‚ โ””โ”€โ”€ embed.py โ”œโ”€โ”€ analysis/ โ”‚ โ””โ”€โ”€ analyse.py โ”œโ”€โ”€ eval/ โ”‚ โ””โ”€โ”€ eval.sh โ””โ”€โ”€ train/ โ”œโ”€โ”€ train_llama.sh โ”œโ”€โ”€ train_qwen.sh โ””โ”€โ”€ training/ โ””โ”€โ”€ stanford_alpaca/ โ”œโ”€โ”€ train_llama.py โ”œโ”€โ”€ train_qwen.py โ”œโ”€โ”€ utils.py โ””โ”€โ”€ configs/ ``` --- ## โš™๏ธ Installation We recommend Python 3.10 or above. Example: ```bash conda create -n adg python=3.12.9 conda activate adg pip install -r requirements.txt ``` Depending on your environment, you may also need to install GPU-specific packages separately. --- ## ๐Ÿงพ Data Format ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below: ```json { "id": 0, "instruction": "Write a short explanation of transformers.", "input": "", "output": "Transformers are neural networks based on self-attention..." } ``` Notes: - `id` should uniquely identify each example. - `instruction` is required. - `input` is optional and can be empty or omitted. - `output` is the reference response in the original instruction dataset. - Other instruction datasets can be used as long as they are converted into this format. After answer generation, the intermediate JSONL file contains records like: ```json { "id": 0, "instruction": "Write a short explanation of transformers.", "output": "Transformers are neural networks based on self-attention...", "generated_answers": [ "...", "...", "...", "...", "..." ] } ``` --- ## ๐Ÿ”„ Pipeline The practical workflow is: ```text instruction pool -> generation/generation.py -> multi-sample answer JSONL -> generation/embedding/embed.py -> instruction embeddings + cluster labels -> ADG/ADG_llama.py or ADG/ADG_qwen.py -> top / middle / bottom selected subsets -> train/train_*.sh -> finetuned checkpoints -> eval/eval.sh ``` --- ## ๐Ÿš€ Quick Start ### Step 1. Prepare the instruction pool Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format. ### Step 2. Generate multiple answers per instruction Before running, update the following variables in `generation/generation.py`: - `MODEL_NAME` - `OUTPUT_DIR` - `OUTPUT_FILE` Then run: ```bash cd generation torchrun --nproc_per_node=4 --master_port=29500 generation.py --input_file /path/to/your/instruction_data.json --batch_size 32 ``` ### Step 3. Build instruction embeddings and clustering results Before running, update the following variables in `generation/embedding/embed.py`: - `MODEL_NAME` - `INPUT_JSONL` - `EMBEDDINGS_PATH` - `CLUSTERS_PATH` - `K_CLUSTERS` Then run: ```bash torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.py ``` ### Step 4. Run ADG scoring and selection Choose the scoring script that matches your backbone. For LLaMA, configure these variables in `ADG/ADG_llama.py`: - `model_name` - `INPUT_JSONL` - `OUTPUT_DIR` - `EMBEDDINGS_PATH` - `CLUSTERS_PATH` - `K_CLUSTERS` - `FINAL_SELECT_COUNT` Then run: ```bash python ADG/ADG_llama.py ``` For Qwen, configure these variables in `ADG/ADG_qwen.py`: - `model_name` - `INPUT_JSONL` - `OUTPUT_DIR` - `EMBEDDINGS_PATH` - `CLUSTERS_PATH` - `CHECKPOINT_DIR` - `FINAL_SELECT_COUNT` Then run: ```bash python ADG/ADG_qwen.py ``` The selector saves: - `top.json` - `middle.json` - `bottom.json` under the configured `OUTPUT_DIR`. ### Step 5. Train the backbone model Use the selected subset, typically `top.json`, for instruction tuning. For LLaMA: ```bash cd train bash train_llama.sh ``` For Qwen: ```bash cd train bash train_qwen.sh ``` Before running, update paths such as: - `--model_name_or_path` - `--data_path` - `--output_dir` ### Step 6. Evaluate the trained checkpoint This repository uses `lm-evaluation-harness` for benchmark evaluation. Install it first if needed: ```bash git clone https://github.com/EleutherAI/lm-evaluation-harness.git cd lm-evaluation-harness pip install -e . ``` Then configure `MODEL_PATH` and output paths in `eval/eval.sh`, and run: ```bash cd eval bash eval.sh ``` The evaluation script currently includes: - BBH - GSM8K - MMLU - TruthfulQA - MBPP - HumanEval --- ## ๐Ÿ“Š ADG Scoring Intuition ADG is built around two complementary signals derived from multiple sampled answers: - **Dispersion magnitude** Measures how widely the sampled answers spread in representation space. - **Shape anisotropy** Measures whether the spread is multi-directional rather than dominated by a single direction. The final ADG score combines these two parts, and the selected subset is obtained through semantic bin-wise ranking. This design helps avoid collapsing selection into only a few dense instruction regions. --- ## ๐Ÿ› ๏ธ Script Notes ### `generation/generation.py` Main functionality: - load the base model, - sample multiple answers for each instruction, - save generated answers in JSONL format, - support distributed generation. ### `generation/embedding/embed.py` Main functionality: - build instruction embeddings, - run clustering, - save instruction embeddings and cluster labels, - provide the semantic bins used by ADG selection. ### `ADG/ADG_llama.py` Main functionality: - read the generated-answer JSONL file, - compute answer-geometry metrics, - combine metrics into the ADG score, - perform proportional cluster-based selection, - save `top.json`, `middle.json`, and `bottom.json`. ### `ADG/ADG_qwen.py` Main functionality: - compute ADG metrics for Qwen-generated answers, - support checkpoint-based resumption, - perform the same top / middle / bottom selection pipeline. ### `analysis/analyse.py` Main functionality: - classify instructions into coarse task categories, - support optional data-level analysis of selected subsets. ### `train/train_llama.sh` and `train/train_qwen.sh` Main functionality: - launch distributed full fine-tuning, - use the selected subset for instruction tuning. ### `eval/eval.sh` Main functionality: - run benchmark evaluation with `lm-evaluation-harness`, - support reasoning, knowledge, and coding tasks. --- ## โ“ Common Issues ### 1. Path configuration is not updated Most scripts use placeholder paths. Update all required paths before running. ### 2. Inconsistent model and intermediate files Make sure the generation backbone, embedding backbone, ADG scoring script, and training script are aligned. ### 3. Missing intermediate files The selector depends on: - generated answer JSONL, - instruction embeddings, - clustering results. Run the previous stages before starting ADG selection. ### 4. GPU memory pressure Generation, embedding, and scoring all use hidden-state-based processing. You may need to reduce batch size or adjust GPU allocation depending on your hardware. ### 5. Evaluation dependency is not installed `eval/eval.sh` depends on `lm-evaluation-harness`. Install it separately before running evaluation. --- ## ๐Ÿ“– Citation If you use this repository, please cite the paper. ---