Improve language tag

#3
by lbourdois - opened
Files changed (1) hide show
  1. README.md +121 -108
README.md CHANGED
@@ -1,108 +1,121 @@
1
-
2
- ---
3
- license: apache-2.0
4
- datasets:
5
- - LONG-1k
6
- base_model:
7
- - Qwen/Qwen2.5-32B-Instruct
8
- pipeline_tag: text-generation
9
- library_name: transformers
10
- ---
11
-
12
-
13
- # Model Description
14
-
15
- - **Paper:** https://arxiv.org/abs/2503.18069
16
-
17
- Difficult problems, which often result in long reasoning traces, are widely recognized as key factors for enhancing the performance of reasoning models. However, such high-challenge problems are scarce, limiting the size of available datasets. In this paper, we propose a simple method to decouple the reliance on problem difficulty. First, we empirically demonstrate that reasoning length, rather than problem difficulty, primarily influences the performance of trained models. Second, we identify a scaling law on reasoning length, showing that model performance increases in a log-linear fashion as the reasoning data length grows. Finally, we introduce a straightforward technique to generate reasoning data of arbitrary length, and show that synthesized data is effective for training reasoning models. After fine-tuning the Qwen2.5-32B-Instruct language model on our Long1 dataset, we present our model, Long1-32B, which achieves remarkable performance with only 1,000 training samples, achieving 95.6% accuracy on MATH, and 71.1% on GPQA outperforming DeepSeek-R1-Distill-Qwen-32B.
18
- 1. Challenging a common assumption: We question the prevalent belief that problem difficulty is the most critical factor. Instead, our experiments suggest that reasoning length is key to training high-performance reasoning models. This insight allows us to build large-scale, long-reasoning datasets without being constrained by the rarity of extremely difficult problems.
19
- 2. Identifying a scaling law on reasoning length: We observe that model performance improves nearly linearly as the length of training data increases exponentially. This phenomenon highlights the efficiency gains achievable by focusing on the length of reasoning sequences.
20
- 3. Proposing a simple synthesis method: We introduce a technique to generate arbitrarily long reasoning data. Using this method, we release the Long1K dataset, upon which our Long1K-32B model is fine-tuned. This model surpasses existing baselines on benchmarks such as MATH500 and GPQA Diamond, demonstrating that extended reasoning sequences can significantly enhance model performance.
21
-
22
-
23
-
24
- # Detail
25
-
26
- Among the work of the thesis, we firstly did two sets of experiments, namely, conceptual synthetic long problems with conceptual synthetic difficult problems, and synthetic long problems with original difficult problems. The related results are shown in the following figure. It turns out that the models perform similarly in mathematical reasoning when the training token lengths are similar. We make a conclusion that the key factor affecting the model's reasoning effectiveness is not the difficulty.
27
-
28
- ![img_3.png](fig1.png)
29
-
30
-
31
- Therefore, we shifted our focus from the difficulty of mathematical problems to the length of mathematical problems. We made the assumption that length is the key factor in constructing inference models. To this end, we explored the effect of different tokens lengths on the reasoning ability of the model at the same difficulty level. Firstly, we classify the token length into 4 levels, whose lengths are 1.5k,3k,6k,12k. Then, we set the number of questions to 500, and conduct experimental validation on Qwen2.5-32B model. The results are shown below. The data show that on the math500 dataset, the performance is close to linearly increasing as the length increases.
32
-
33
- ![img_2.png](fig2.png)
34
-
35
- In addition, we compared the reasoning processes of two models trained with reasoning lengths of 1.5k and 12k, respectively, on the MATH500 test set, including both successful and failed reasoning attempts. Our analysis included statistical comparisons of the average reasoning token length and the top 10 most frequently used words during reasoning. The goal was to understand why the model trained with a reasoning length of 12k achieved an accuracy improvement of over 5%.
36
-
37
- | Dataset Size | Correct/Wrong | Average Tokens | Top 10 Frequently Occurring Words |
38
- |--------------|---------------|----------------|------------------------------------------------------------------------------------------------|
39
- | 1.5k | Correct | 2147.65 | the(5.30%) is(3.24%) so(1.98%) of(1.45%) to(1.44%) and(1.25%) that(1.17%) let(1.08%) **wait(1.07%)** **but(0.91%)** |
40
- | 12k | Correct | 4716.27 | the(4.92%) is(3.04%) so(1.83%) to(1.41%) of(1.25%) and(1.19%) **but(1.19%)** let(0.93%) that(0.90%) **wait(0.81%)** |
41
- | 1.5k | Wrong | 8247.21 | **but(5.05%)** the(5.00%) **wait(3.78%)** is(3.24%) of(1.29%) so(1.26%) therefore(1.16%) to(1.08%) and(1.01%) that(0.70%) |
42
- | 12k | Wrong | 15694.54 | the(5.12%) is(2.85%) to(1.64%) and(1.42%) **but(1.27%)** of(1.20%) so(1.08%) **wait(0.80%)** that(0.80%) in(0.75%) |
43
-
44
-
45
-
46
-
47
- # Training Data
48
- We conducted relevant experiments using our own synthesized [LONG1k](https://huggingface.co/datasets/ZTss/LONG1k) dataset. LONG1k is a composite data generated for model training from two datasets, Openthouhts114k and s1.1. Specifically, on one hand, we randomly select two mathematical problems from Openthouhts114k. The problems, reasoning processes, and results of these two mathematical problems are concatenated together using different linking words to increase the length of the prompts. On the other hand, in order to avoid overfitting of the model to two mathematical problems and improve its robustness, we also extracted a certain number of mathematical problems that meet the length requirements from the s1.1 dataset and fused them into LONG1k. Ultimately, the synthetic data LONG1k used for model training will consist of these two parts of data. Of course, in different experiments, the ratio of the length of the two parts of the problem and the number of markers will be dynamically adjusted according to the experimental requirements.
49
-
50
-
51
- # Evaluation
52
-
53
- | Model | Dataset Size | MATH_500 | AIME_2024 | AIME_2025 | GPQA_Diamond |
54
- |---|---|---|---|---|---|
55
- | s1-32B | 1k | 92.6 | 50.0 | 26.7 | 56.6 |
56
- | s1.1-32B | 1k | 89.0 | 64.7 | 49.3 | 60.1 |
57
- | LIMO | 0.8k | <u>94.8</u> | 57.1 | 49.3 | <u>66.7</u> |
58
- | OpenThinker-32B | 114k | 90.6 | <u>66.0</u> | <u>53.3</u> | 61.6 |
59
- | DeepSeek-R1-Distill-Qwen-32B | 800K | 93.0 | **72.6** | **55.9** | 62.1 |
60
- | Long1-32B | 1K | **95.6** | 50.7 | <u>53.3</u> | **71.1** |
61
-
62
- Performance comparison of different models across multiple reasoning benchmarks (pass@1). The best results for each benchmark are highlighted in bold, with the second-best underlined. The data for s1 does not use budget forcing, and the data for s1.1 that does not use budget forcing comes from Open Thoughts.
63
-
64
-
65
- # Uses
66
- We have uploaded our reasoning and evaluation scripts. If you are interested in using it, please follow the steps below.
67
-
68
- ## Requirement
69
- ```
70
- pip install -r requirements.txt
71
- ```
72
- ## Reasoning
73
- After downloading the model, please use the following code to perform result inference.
74
- ```
75
- bash predict.sh
76
- ```
77
-
78
-
79
- ## Evaluation
80
- Use the following code to calculate indicators.
81
- ```
82
- python calc_metric_lc.py
83
- ```
84
-
85
- # Authors
86
-
87
- - **Name**: Si Shen
88
- **Organization**: Nanjing University of Science and Technology
89
-
90
- - **Name**: Zhixiao Zhao
91
- **Organization**: Nanjing Agricultural University
92
-
93
- - **Name**: Chang Liu
94
- **Organization**: Nanjing Agricultural University
95
-
96
- - **Name**: Tiansheng Zheng
97
- **Organization**: Nanjing Agricultural University
98
-
99
- - **Name**:
100
- **Organization**: Nanjing Agricultural University
101
-
102
- - **Name**: Fei Huang
103
- **Organization**: Nanjing University of Science and Technology
104
-
105
- - **Name**: Danhao Zhu
106
- **Email**: zhudanhao@jspi.cn
107
- **Organization**: Jiangsu Police Institute
108
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - LONG-1k
5
+ base_model:
6
+ - Qwen/Qwen2.5-32B-Instruct
7
+ pipeline_tag: text-generation
8
+ library_name: transformers
9
+ language:
10
+ - zho
11
+ - eng
12
+ - fra
13
+ - spa
14
+ - por
15
+ - deu
16
+ - ita
17
+ - rus
18
+ - jpn
19
+ - kor
20
+ - vie
21
+ - tha
22
+ - ara
23
+ ---
24
+
25
+
26
+ # Model Description
27
+
28
+ - **Paper:** https://arxiv.org/abs/2503.18069
29
+
30
+ Difficult problems, which often result in long reasoning traces, are widely recognized as key factors for enhancing the performance of reasoning models. However, such high-challenge problems are scarce, limiting the size of available datasets. In this paper, we propose a simple method to decouple the reliance on problem difficulty. First, we empirically demonstrate that reasoning length, rather than problem difficulty, primarily influences the performance of trained models. Second, we identify a scaling law on reasoning length, showing that model performance increases in a log-linear fashion as the reasoning data length grows. Finally, we introduce a straightforward technique to generate reasoning data of arbitrary length, and show that synthesized data is effective for training reasoning models. After fine-tuning the Qwen2.5-32B-Instruct language model on our Long1 dataset, we present our model, Long1-32B, which achieves remarkable performance with only 1,000 training samples, achieving 95.6% accuracy on MATH, and 71.1% on GPQA outperforming DeepSeek-R1-Distill-Qwen-32B.
31
+ 1. Challenging a common assumption: We question the prevalent belief that problem difficulty is the most critical factor. Instead, our experiments suggest that reasoning length is key to training high-performance reasoning models. This insight allows us to build large-scale, long-reasoning datasets without being constrained by the rarity of extremely difficult problems.
32
+ 2. Identifying a scaling law on reasoning length: We observe that model performance improves nearly linearly as the length of training data increases exponentially. This phenomenon highlights the efficiency gains achievable by focusing on the length of reasoning sequences.
33
+ 3. Proposing a simple synthesis method: We introduce a technique to generate arbitrarily long reasoning data. Using this method, we release the Long1K dataset, upon which our Long1K-32B model is fine-tuned. This model surpasses existing baselines on benchmarks such as MATH500 and GPQA Diamond, demonstrating that extended reasoning sequences can significantly enhance model performance.
34
+
35
+
36
+
37
+ # Detail
38
+
39
+ Among the work of the thesis, we firstly did two sets of experiments, namely, conceptual synthetic long problems with conceptual synthetic difficult problems, and synthetic long problems with original difficult problems. The related results are shown in the following figure. It turns out that the models perform similarly in mathematical reasoning when the training token lengths are similar. We make a conclusion that the key factor affecting the model's reasoning effectiveness is not the difficulty.
40
+
41
+ ![img_3.png](fig1.png)
42
+
43
+
44
+ Therefore, we shifted our focus from the difficulty of mathematical problems to the length of mathematical problems. We made the assumption that length is the key factor in constructing inference models. To this end, we explored the effect of different tokens lengths on the reasoning ability of the model at the same difficulty level. Firstly, we classify the token length into 4 levels, whose lengths are 1.5k,3k,6k,12k. Then, we set the number of questions to 500, and conduct experimental validation on Qwen2.5-32B model. The results are shown below. The data show that on the math500 dataset, the performance is close to linearly increasing as the length increases.
45
+
46
+ ![img_2.png](fig2.png)
47
+
48
+ In addition, we compared the reasoning processes of two models trained with reasoning lengths of 1.5k and 12k, respectively, on the MATH500 test set, including both successful and failed reasoning attempts. Our analysis included statistical comparisons of the average reasoning token length and the top 10 most frequently used words during reasoning. The goal was to understand why the model trained with a reasoning length of 12k achieved an accuracy improvement of over 5%.
49
+
50
+ | Dataset Size | Correct/Wrong | Average Tokens | Top 10 Frequently Occurring Words |
51
+ |--------------|---------------|----------------|------------------------------------------------------------------------------------------------|
52
+ | 1.5k | Correct | 2147.65 | the(5.30%) is(3.24%) so(1.98%) of(1.45%) to(1.44%) and(1.25%) that(1.17%) let(1.08%) **wait(1.07%)** **but(0.91%)** |
53
+ | 12k | Correct | 4716.27 | the(4.92%) is(3.04%) so(1.83%) to(1.41%) of(1.25%) and(1.19%) **but(1.19%)** let(0.93%) that(0.90%) **wait(0.81%)** |
54
+ | 1.5k | Wrong | 8247.21 | **but(5.05%)** the(5.00%) **wait(3.78%)** is(3.24%) of(1.29%) so(1.26%) therefore(1.16%) to(1.08%) and(1.01%) that(0.70%) |
55
+ | 12k | Wrong | 15694.54 | the(5.12%) is(2.85%) to(1.64%) and(1.42%) **but(1.27%)** of(1.20%) so(1.08%) **wait(0.80%)** that(0.80%) in(0.75%) |
56
+
57
+
58
+
59
+
60
+ # Training Data
61
+ We conducted relevant experiments using our own synthesized [LONG1k](https://huggingface.co/datasets/ZTss/LONG1k) dataset. LONG1k is a composite data generated for model training from two datasets, Openthouhts114k and s1.1. Specifically, on one hand, we randomly select two mathematical problems from Openthouhts114k. The problems, reasoning processes, and results of these two mathematical problems are concatenated together using different linking words to increase the length of the prompts. On the other hand, in order to avoid overfitting of the model to two mathematical problems and improve its robustness, we also extracted a certain number of mathematical problems that meet the length requirements from the s1.1 dataset and fused them into LONG1k. Ultimately, the synthetic data LONG1k used for model training will consist of these two parts of data. Of course, in different experiments, the ratio of the length of the two parts of the problem and the number of markers will be dynamically adjusted according to the experimental requirements.
62
+
63
+
64
+ # Evaluation
65
+
66
+ | Model | Dataset Size | MATH_500 | AIME_2024 | AIME_2025 | GPQA_Diamond |
67
+ |---|---|---|---|---|---|
68
+ | s1-32B | 1k | 92.6 | 50.0 | 26.7 | 56.6 |
69
+ | s1.1-32B | 1k | 89.0 | 64.7 | 49.3 | 60.1 |
70
+ | LIMO | 0.8k | <u>94.8</u> | 57.1 | 49.3 | <u>66.7</u> |
71
+ | OpenThinker-32B | 114k | 90.6 | <u>66.0</u> | <u>53.3</u> | 61.6 |
72
+ | DeepSeek-R1-Distill-Qwen-32B | 800K | 93.0 | **72.6** | **55.9** | 62.1 |
73
+ | Long1-32B | 1K | **95.6** | 50.7 | <u>53.3</u> | **71.1** |
74
+
75
+ Performance comparison of different models across multiple reasoning benchmarks (pass@1). The best results for each benchmark are highlighted in bold, with the second-best underlined. The data for s1 does not use budget forcing, and the data for s1.1 that does not use budget forcing comes from Open Thoughts.
76
+
77
+
78
+ # Uses
79
+ We have uploaded our reasoning and evaluation scripts. If you are interested in using it, please follow the steps below.
80
+
81
+ ## Requirement
82
+ ```
83
+ pip install -r requirements.txt
84
+ ```
85
+ ## Reasoning
86
+ After downloading the model, please use the following code to perform result inference.
87
+ ```
88
+ bash predict.sh
89
+ ```
90
+
91
+
92
+ ## Evaluation
93
+ Use the following code to calculate indicators.
94
+ ```
95
+ python calc_metric_lc.py
96
+ ```
97
+
98
+ # Authors
99
+
100
+ - **Name**: Si Shen
101
+ **Organization**: Nanjing University of Science and Technology
102
+
103
+ - **Name**: Zhixiao Zhao
104
+ **Organization**: Nanjing Agricultural University
105
+
106
+ - **Name**: Chang Liu
107
+ **Organization**: Nanjing Agricultural University
108
+
109
+ - **Name**: Tiansheng Zheng
110
+ **Organization**: Nanjing Agricultural University
111
+
112
+ - **Name**:
113
+ **Organization**: Nanjing Agricultural University
114
+
115
+ - **Name**: Fei Huang
116
+ **Organization**: Nanjing University of Science and Technology
117
+
118
+ - **Name**: Danhao Zhu
119
+ **Email**: zhudanhao@jspi.cn
120
+ **Organization**: Jiangsu Police Institute
121
+