Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,49 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
---
|
| 4 |
+
## Achieving Superior Performance over QwQ-32B Using Only 965 Strategically Curated Samples
|
| 5 |
+
|
| 6 |
+
### Model description
|
| 7 |
+
Most existing mthods focused on distilling DeepSeek-R1 to improve reasoning ability. However, as far as we know, there is no distilled model could surpass DeepSeek-R1 or QwQ-32B. We introduce NTele-R1-32B-DS , a state-of-the-art mathematical reasoning model that outperforms QwQ-32B across common reasoning benchmarks, including AIME2024/2025, MATH500 and GPQA-Diamond.
|
| 8 |
+
Notebly, NTele-R1-32B-DS is the first that achieves **more than 80/70 in challenging AIME2024/2025**.
|
| 9 |
+
| Model | Trained From | Release Date | AIME2024(ours/reported) | AIME2025(ours/reported) | MATH500(ours/reported) | GPQA-Diamond(ours/reported) |
|
| 10 |
+
|-------|-------|-------|-------|-------|-------|-------|
|
| 11 |
+
| QwQ-32B | - | 25.3.6 | 76.25 / 79.5 | 67.30 / - | 94.6 / - | 63.6 / - |
|
| 12 |
+
| DeepSeek-32B-Distill | Qwen2.5-32B-Instruct | 25.1.20 | 64.17 / 72.6 | 55.21 / - | 89.8 / 94.3 | 62.1 / 62.1 |
|
| 13 |
+
| Light-R1-32B-DS | DeepSeek-R1-Distill-Qwen-32B | 25.3.12 | 74.79 / 78.1 | 68.54 / 65.9 | 92 / - | **69.19 / 68.0** |
|
| 14 |
+
| AReal-boba-SFT-32B | DeepSeek-R1-Distill-Qwen-32B | 25.3.30 | 70.63 / 78.8 | 63.54 / 62.1 | 88.8 / - | 64.65 / 60.1 |
|
| 15 |
+
| Ntele-R1-32B-DS | DeepSeek-R1-Distill-Qwen-32B | 25.4.17 | **80.42**| **73.54** | **95.4** | 66.16 |
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
### Data Curation
|
| 19 |
+
We start from the S1 dataset and conduct the following procedures:
|
| 20 |
+
1. QwQ-32B as a Better Teacher :
|
| 21 |
+
- We find that QwQ-32B, with its smoother flow in CoT reasoning, serves as a better teacher compared to DeepSeek-R1. For each question in S1 dataset, we sampled 50 responses from QwQ-32B.
|
| 22 |
+
2. Focusing on Harder Questions :
|
| 23 |
+
- We evaluated the correctness of the responses for each question. After that, we filtered out the easier questions with a pass rate exceeding 0.6.
|
| 24 |
+
3. Diverse Reasoning Paths Break the Limitation of Distillation :
|
| 25 |
+
- To maximize the diversity of reasoning paths, we calculated the Levenshtein distance between all answers for each question. For every question, we selected up to 5 answers for each question with the greatest distances, resulting in the final dataset with 965 samples.
|
| 26 |
+
|
| 27 |
+
You can access our [dataset](https://huggingface.co/datasets/ZTE-AIM/NTele-R1-Data) to get 965 training data
|
| 28 |
+
|
| 29 |
+

|
| 30 |
+
|
| 31 |
+
### Evaluation
|
| 32 |
+
We evaluate models with [SkyThought](https://github.com/NovaSky-AI/SkyThought).
|
| 33 |
+
|
| 34 |
+
### Training Details
|
| 35 |
+
NTele-R1-32B-DS was trained from DeepSeek-32B-Distill on 8xH800.
|
| 36 |
+
|
| 37 |
+
#### Training hyperparameter
|
| 38 |
+
- learning_rate: 1e-05
|
| 39 |
+
- train_batch_size: 1
|
| 40 |
+
- eval_batch_size: 1
|
| 41 |
+
- seed: 42
|
| 42 |
+
- distributed_type: multi-GPU
|
| 43 |
+
- num_devices: 8
|
| 44 |
+
- gradient_accumulation_steps: 6
|
| 45 |
+
- total_train_batch_size: 48
|
| 46 |
+
- total_eval_batch_size: 48
|
| 47 |
+
- lr_scheduler_type: cosine
|
| 48 |
+
- lr_scheduler_warmup_ratio: 0.1
|
| 49 |
+
- num_epochs: 10.0
|