on-policy-distillation

Running

Rephrase dataset description

by cmpatino HF Staff - opened Jan 21

←

Files changed (1) hide show

app/src/content/article.mdx CHANGED Viewed

@@ -224,7 +224,7 @@ Below is an example of the system and user prompts we pass to the model for the
 ### Dataset
-We sourced all the prompts from the [Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) dataset. Our dataset contains 80k training prompts and 10k testing prompts selected randomly. We then generated responses from the `Qwen/Qwen2.5-7B-Instruct` and `Qwen/Qwen3-4B-Instruct-2507` teacher models, including only the prompts that had the correct answers from the teachers. After filtering, the training datasets had a total of 15.2k prompts for `Qwen/Qwen2.5-7B-Instruct` and 27.7k for `Qwen/Qwen3-4B-Instruct-2507`. We use the prompts in the training dataset with 15.2k for all the on-policy experiments because we use the student’s generations instead of the teacher’s completions.
 <iframe
   src="https://huggingface.co/datasets/HuggingFaceTB/Countdown-Task-GOLD/embed/viewer/verified_Qwen3-4B-Instruct-2507/train"

 ### Dataset
+We sourced all the prompts from the [Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) dataset. Our full dataset contains 80k training prompts and 10k testing prompts selected randomly. We then generated responses from the `Qwen/Qwen2.5-7B-Instruct` and `Qwen/Qwen3-4B-Instruct-2507` teacher models, including only the prompts that had the correct answers from the teachers. Our published dataset contains 30.4k prompts for `Qwen/Qwen2.5-7B-Instruct` and 27.7k for `Qwen/Qwen3-4B-Instruct-2507` generations. We use the prompts in the training dataset with 30.4k prompts for all the on-policy experiments because we use the student’s generations instead of the teacher’s completions.
 <iframe
   src="https://huggingface.co/datasets/HuggingFaceTB/Countdown-Task-GOLD/embed/viewer/verified_Qwen3-4B-Instruct-2507/train"