ZTss commited on
Commit
8070be1
·
verified ·
1 Parent(s): b2f96ef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -8
README.md CHANGED
@@ -46,14 +46,16 @@ We also found that increasing the number of topics on low-length data increases
46
 
47
  # Evaluation
48
 
49
- | Model | Dataset Size | AIME_2024 | MATH_500 |
50
- |---|---|---|---|
51
- | o1-Preview | - | 40.0 | 81.4 |
52
- | DeepSeek-R1-Distill-Qwen-32B | 800k | 72.6 | 94.3 |
53
- | OpenThinker-32B | 114k | 66.0 | 90.6 |
54
- | DeepSeek-R1 | >800K | 79.8 | 97.3 |
55
- | s1.1-32B | 1K | 56.0(56.7) | 94.4(95.4) |
56
- | LONG1 | 1K | 63.3 | 95.6 |
 
 
57
 
58
 
59
  # Uses
 
46
 
47
  # Evaluation
48
 
49
+ | Model | Dataset Size | MATH_500 | AIME_2024 | AIME_2025 | GPQA_Diamond |
50
+ |---|---|---|---|---|---|
51
+ | s1-32B | 1k | 92.6 | 50.0 | 26.7 | 56.6 |
52
+ | s1.1-32B | 1k | 89.0 | 64.7 | 49.3 | 60.1 |
53
+ | LIMO | 0.8k | <u>94.8</u> | 57.1 | 49.3 | <u>66.7</u> |
54
+ | OpenThinker-32B | 114k | 90.6 | <u>66.0</u> | <u>53.3</u> | 61.6 |
55
+ | DeepSeek-R1-Distill-Qwen-32B | 800K | 93.0 | **72.6** | **55.9** | 62.1 |
56
+ | Long1-32B | 1K | **95.6** | 50.7 | <u>53.3</u> | **71.1** |
57
+
58
+ Performance comparison of different models across multiple reasoning benchmarks (pass@1). The best results for each benchmark are highlighted in bold, with the second-best underlined. The data for s1 does not use budget forcing, and the data for s1.1 that does not use budget forcing comes from Open Thoughts
59
 
60
 
61
  # Uses