WisdomShell
/

ADG-WizardLM-LLaMa3-8B

@@ -14,7 +14,7 @@ tags:
 <h1>Instruction Data Selection via Answer Divergence<h1>
   <p>
-  <strong>English</strong> | <a href="https://huggingface.co/WisdomShell/ADG-WizardLM-LLaMa3-8B/blob/main/README_zh.md">简体中文</a>
 </p>
 <a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a>
@@ -53,83 +53,6 @@ This repository provides the practical pipeline for:
 - benchmark evaluation,
 - optional task-type analysis.
----
-Use this model ,you need clone follow repository
-```python
-git clone https://github.com/WisdomShell/ADG.git
-```
-## 📦 What Is Released
-This repository includes the following components:
-### Core selection code
-- `ADG/ADG_llama.py`
-  ADG scoring and selection for the LLaMA backbone.
-- `ADG/ADG_qwen.py`
-  ADG scoring and selection for the Qwen backbone.
-### Answer generation and instruction embedding
-- `generation/generation.py`
-  Generates multiple sampled answers for each instruction.
-- `generation/embedding/embed.py`
-  Builds instruction embeddings and performs clustering for bin-wise selection.
-### Training and evaluation
-- `train/train_llama.sh`
-  Training entry script for LLaMA.
-- `train/train_qwen.sh`
-  Training entry script for Qwen.
-- `train/training/stanford_alpaca/`
-  Training utilities and backbone-specific training scripts.
-- `eval/eval.sh`
-  Evaluation script based on `lm-evaluation-harness`.
-### Analysis
-- `analysis/analyse.py`
-  Optional task-type classification script for analyzing selected data.
-### Environment
-- `requirements.txt`
-  Required Python packages for this repository.
----
-## 🗂️ Repository Structure
-```text
-.
-├── README.md
-├── README_zh.md
-├── requirements.txt
-├── ADG/
-│   ├── ADG_llama.py
-│   └── ADG_qwen.py
-├── generation/
-│   ├── generation.py
-│   └── embedding/
-│       └── embed.py
-├── analysis/
-│   └── analyse.py
-├── eval/
-│   └── eval.sh
-└── train/
-    ├── train_llama.sh
-    ├── train_qwen.sh
-    └── training/
-        └── stanford_alpaca/
-            ├── train_llama.py
-            ├── train_qwen.py
-            ├── utils.py
-            └── configs/
-```
----
 ## ⚙️ Installation
@@ -138,6 +61,7 @@ We recommend Python 3.10 or above.
 Example:
 ```bash
 conda create -n adg python=3.12.9
 conda activate adg
 pip install -r requirements.txt
@@ -186,25 +110,6 @@ After answer generation, the intermediate JSONL file contains records like:
 ---
-## 🔄 Pipeline
-The practical workflow is:
-```text
-instruction pool
-    -> generation/generation.py
-    -> multi-sample answer JSONL
-    -> generation/embedding/embed.py
-    -> instruction embeddings + cluster labels
-    -> ADG/ADG_llama.py or ADG/ADG_qwen.py
-    -> top / middle / bottom selected subsets
-    -> train/train_*.sh
-    -> finetuned checkpoints
-    -> eval/eval.sh
-```
----
 ## 🚀 Quick Start
 ### Step 1. Prepare the instruction pool
@@ -333,93 +238,15 @@ The evaluation script currently includes:
 ---
-## 📊 ADG Scoring Intuition
-ADG is built around two complementary signals derived from multiple sampled answers:
-- **Dispersion magnitude**
-  Measures how widely the sampled answers spread in representation space.
-- **Shape anisotropy**
-  Measures whether the spread is multi-directional rather than dominated by a single direction.
-The final ADG score combines these two parts, and the selected subset is obtained through semantic bin-wise ranking. This design helps avoid collapsing selection into only a few dense instruction regions.
----
-## 🛠️ Script Notes
-### `generation/generation.py`
-Main functionality:
-- load the base model,
-- sample multiple answers for each instruction,
-- save generated answers in JSONL format,
-- support distributed generation.
-### `generation/embedding/embed.py`
-Main functionality:
-- build instruction embeddings,
-- run clustering,
-- save instruction embeddings and cluster labels,
-- provide the semantic bins used by ADG selection.
-### `ADG/ADG_llama.py`
-Main functionality:
-- read the generated-answer JSONL file,
-- compute answer-geometry metrics,
-- combine metrics into the ADG score,
-- perform proportional cluster-based selection,
-- save `top.json`, `middle.json`, and `bottom.json`.
-### `ADG/ADG_qwen.py`
-Main functionality:
-- compute ADG metrics for Qwen-generated answers,
-- support checkpoint-based resumption,
-- perform the same top / middle / bottom selection pipeline.
-### `analysis/analyse.py`
-Main functionality:
-- classify instructions into coarse task categories,
-- support optional data-level analysis of selected subsets.
-### `train/train_llama.sh` and `train/train_qwen.sh`
-Main functionality:
-- launch distributed full fine-tuning,
-- use the selected subset for instruction tuning.
-### `eval/eval.sh`
-Main functionality:
-- run benchmark evaluation with `lm-evaluation-harness`,
-- support reasoning, knowledge, and coding tasks.
----
-## ❓ Common Issues
-### 1. Path configuration is not updated
-Most scripts use placeholder paths. Update all required paths before running.
-### 2. Inconsistent model and intermediate files
-Make sure the generation backbone, embedding backbone, ADG scoring script, and training script are aligned.
-### 3. Missing intermediate files
-The selector depends on:
-- generated answer JSONL,
-- instruction embeddings,
-- clustering results.
-Run the previous stages before starting ADG selection.
-### 4. GPU memory pressure
-Generation, embedding, and scoring all use hidden-state-based processing. You may need to reduce batch size or adjust GPU allocation depending on your hardware.
-### 5. Evaluation dependency is not installed
-`eval/eval.sh` depends on `lm-evaluation-harness`. Install it separately before running evaluation.
----
 ## 📖 Citation
-If you use this repository, please cite the paper.
 ---

 <h1>Instruction Data Selection via Answer Divergence<h1>
   <p>
+  <strong>English</strong> | <a href="https://huggingface.co/WisdomShell/ADG-Alpaca-GPT4-LLaMa3-8B/blob/main/README_zh.md">简体中文</a>
 </p>
 <a href="https://wisdomshell.github.io/ADG/"><img src="https://img.shields.io/badge/Project-Page-green?logo=githubpages&logoColor=white" /></a>
 - benchmark evaluation,
 - optional task-type analysis.
 ## ⚙️ Installation
 Example:
 ```bash
+git clone https://github.com/WisdomShell/ADG.git
 conda create -n adg python=3.12.9
 conda activate adg
 pip install -r requirements.txt
 ---
 ## 🚀 Quick Start
 ### Step 1. Prepare the instruction pool
 ---
 ## 📖 Citation
+```bibtex
+@article{li2026instruction,
+  title={Instruction Data Selection via Answer Divergence},
+  author={Li, Bo and Wang, Mingda and Zhang, Shikun and Ye, Wei},
+  journal={arXiv preprint arXiv:2604.10448},
+  year={2026}
+}
+```
 ---