ZTss
/

LONG1k-32B

@@ -1,38 +1,45 @@
 ---
-license: apache-2.0
-datasets:
-- LONG-1k
 base_model:
 - Qwen/Qwen2.5-32B-Instruct
-pipeline_tag: text-generation
 library_name: transformers
 ---
-# Model Description
-- **Paper:** https://arxiv.org/abs/2503.18069
- Difficult problems, which often result in long reasoning traces, are widely recognized as key factors for enhancing the performance of reasoning models. However, such high-challenge problems are scarce, limiting the size of available datasets. In this paper, we propose a simple method to decouple the reliance on problem difficulty. First, we empirically demonstrate that reasoning length, rather than problem difficulty, primarily influences the performance of trained models. Second, we identify a scaling law on reasoning length, showing that model performance increases in a log-linear fashion as the reasoning data length grows. Finally, we introduce a straightforward technique to generate reasoning data of arbitrary length, and show that synthesized data is effective for training reasoning models. After fine-tuning the Qwen2.5-32B-Instruct language model on our Long1 dataset, we present our model, Long1-32B, which achieves remarkable performance with only 1,000 training samples, achieving 95.6% accuracy on MATH, and 71.1% on GPQA outperforming DeepSeek-R1-Distill-Qwen-32B.
-1. Challenging a common assumption: We question the prevalent belief that problem difficulty is the most critical factor. Instead, our experiments suggest that reasoning length is key to training high-performance reasoning models. This insight allows us to build large-scale, long-reasoning datasets without being constrained by the rarity of extremely difficult problems.
-2. Identifying a scaling law on reasoning length: We observe that model performance improves nearly linearly as the length of training data increases exponentially. This phenomenon highlights the efficiency gains achievable by focusing on the length of reasoning sequences.
-3. Proposing a simple synthesis method: We introduce a technique to generate arbitrarily long reasoning data. Using this method, we release the Long1K dataset, upon which our Long1K-32B model is fine-tuned. This model surpasses existing baselines on benchmarks such as MATH500 and GPQA Diamond, demonstrating that extended reasoning sequences can significantly enhance model performance.
 # Detail
-  Among the work of the thesis, we firstly did two sets of experiments, namely, conceptual synthetic long problems with conceptual synthetic difficult problems, and synthetic long problems with original difficult problems. The related results are shown in the following figure. It turns out that the models perform similarly in mathematical reasoning when the training token lengths are similar. We make a conclusion that the key factor affecting the model's reasoning effectiveness is not the difficulty.
 ![img_3.png](fig1.png)
-  Therefore, we shifted our focus from the difficulty of mathematical problems to the length of mathematical problems. We made the assumption that length is the key factor in constructing inference models. To this end, we explored the effect of different tokens lengths on the reasoning ability of the model at the same difficulty level. Firstly, we classify the token length into 4 levels, whose lengths are 1.5k,3k,6k,12k. Then, we set the number of questions to 500, and conduct experimental validation on Qwen2.5-32B model. The results are shown below. The data show that on the math500 dataset, the performance is close to linearly increasing as the length increases.
 ![img_2.png](fig2.png)
-  In addition, we compared the reasoning processes of two models trained with reasoning lengths of 1.5k and 12k, respectively, on the MATH500 test set, including both successful and failed reasoning attempts. Our analysis included statistical comparisons of the average reasoning token length and the top 10 most frequently used words during reasoning. The goal was to understand why the model trained with a reasoning length of 12k achieved an accuracy improvement of over 5%.
 | Dataset Size | Correct/Wrong | Average Tokens | Top 10 Frequently Occurring Words                                                                 |
 |--------------|---------------|----------------|------------------------------------------------------------------------------------------------|
@@ -42,10 +49,9 @@ library_name: transformers
 | 12k          | Wrong         | 15694.54       | the(5.12%) is(2.85%) to(1.64%) and(1.42%) **but(1.27%)** of(1.20%) so(1.08%) **wait(0.80%)** that(0.80%) in(0.75%) |
 # Training Data
-  We conducted relevant experiments using our own synthesized [LONG1k](https://huggingface.co/datasets/ZTss/LONG1k) dataset. LONG1k is a composite data generated for model training from two datasets, Openthouhts114k and s1.1. Specifically, on one hand, we randomly select two mathematical problems from Openthouhts114k. The problems, reasoning processes, and results of these two mathematical problems are concatenated together using different linking words to increase the length of the prompts. On the other hand, in order to avoid overfitting of the model to two mathematical problems and improve its robustness, we also extracted a certain number of mathematical problems that meet the length requirements from the s1.1 dataset and fused them into LONG1k. Ultimately, the synthetic data LONG1k used for model training will consist of these two parts of data. Of course, in different experiments, the ratio of the length of the two parts of the problem and the number of markers will be dynamically adjusted according to the experimental requirements.
 # Evaluation
@@ -54,17 +60,17 @@ library_name: transformers
 |---|---|---|---|---|---|
 | s1-32B | 1k | 92.6 | 50.0 | 26.7 | 56.6 |
 | s1.1-32B | 1k | 89.0 | 64.7 | 49.3 | 60.1 |
-| LIMO | 0.8k | <u>94.8</u> | 57.1 | 49.3 | <u>66.7</u> |
-| OpenThinker-32B | 114k | 90.6 | <u>66.0</u> | <u>53.3</u> | 61.6 |
-| DeepSeek-R1-Distill-Qwen-32B | 800K | 93.0 | **72.6** | **55.9** | 62.1 |
-| Long1-32B | 1K | **95.6** | 50.7 | <u>53.3</u> | **71.1** |
 Performance comparison of different models across multiple reasoning benchmarks (pass@1). The best results for each benchmark are highlighted in bold, with the second-best underlined. The data for s1 does not use budget forcing, and the data for s1.1 that does not use budget forcing comes from Open Thoughts.
 # Uses
-  Loading...
 # Authors
@@ -80,7 +86,7 @@ Performance comparison of different models across multiple reasoning benchmarks
 - **Name**: Tiansheng Zheng
   **Organization**: Nanjing Agricultural University
-- **Name**:
   **Organization**: Nanjing Agricultural University
 - **Name**: Fei Huang
@@ -89,4 +95,7 @@ Performance comparison of different models across multiple reasoning benchmarks
 - **Name**: Danhao Zhu
   **Email**: zhudanhao@jspi.cn
   **Organization**: Jiangsu Police Institute

 ---
 base_model:
 - Qwen/Qwen2.5-32B-Instruct
+datasets:
+- LONG-1k
 library_name: transformers
+license: apache-2.0
+pipeline_tag: text-generation
 ---
+# Long Is More Important Than Difficult for Training Reasoning Models
+The model was presented in the paper [Long Is More Important Than Difficult for Training Reasoning Models](https://huggingface.co/papers/2503.18069).
+# Paper abstract
+Difficult problems, which often result in long reasoning traces, are widely recognized as key factors for enhancing the performance of reasoning models. However, such high-challenge problems are scarce, limiting the size of available datasets. In this paper, we propose a simple method to decouple the reliance on problem difficulty. First, we empirically demonstrate that reasoning length, rather than problem difficulty, primarily influences the performance of trained models. Second, we identify a scaling law on reasoning length, showing that model performance increases in a log-linear fashion as the reasoning data length grows. Finally, we introduce a straightforward technique to generate reasoning data of arbitrary length, and show that synthesized data is effective for training reasoning models. After fine-tuning the Qwen2.5-32B-Instruct language model on our Long1K dataset, we present our model, Long1K-32B, which achieves remarkable performance with only 1,000 training samples, achieving 95.6\% accuracy on MATH, and 71.1\% on GPQA outperforming DeepSeek-R1-Distill-Qwen-32B. The model, code, and dataset are all open-sourced, available at [https://huggingface.co/ZTss/LONG1](https://huggingface.co/ZTss/LONG1).
+# Model Description
+This model, Long1K-32B, is a fine-tuned version of Qwen2.5-32B-Instruct trained on the Long1K dataset.  The key finding of the accompanying paper is that reasoning length, more so than problem difficulty, significantly impacts the performance of reasoning models.
+# Key Findings
+1. **Challenging a common assumption:** The paper challenges the prevalent belief that problem difficulty is the most critical factor in training high-performance reasoning models.  Experiments suggest that reasoning length is key.
+2. **Identifying a scaling law on reasoning length:** Model performance improves nearly linearly as the length of training data increases exponentially.
+3. **Proposing a simple synthesis method:** A technique for generating arbitrarily long reasoning data is introduced.  The Long1K dataset, used to fine-tune Long1K-32B, is created using this method.
 # Detail
+Among the work of the thesis, we firstly did two sets of experiments, namely, conceptual synthetic long problems with conceptual synthetic difficult problems, and synthetic long problems with original difficult problems. The related results are shown in the following figure. It turns out that the models perform similarly in mathematical reasoning when the training token lengths are similar. We conclude that the key factor affecting the model's reasoning effectiveness is not the difficulty.
 ![img_3.png](fig1.png)
+Therefore, we shifted our focus from the difficulty of mathematical problems to the length of mathematical problems. We made the assumption that length is the key factor in constructing inference models. To this end, we explored the effect of different token lengths on the reasoning ability of the model at the same difficulty level. Firstly, we classified the token length into 4 levels, whose lengths are 1.5k, 3k, 6k, 12k. Then, we set the number of questions to 500, and conducted experimental validation on the Qwen2.5-32B model. The results are shown below. The data show that on the math500 dataset, the performance is close to linearly increasing as the length increases.
 ![img_2.png](fig2.png)
+In addition, we compared the reasoning processes of two models trained with reasoning lengths of 1.5k and 12k, respectively, on the MATH500 test set, including both successful and failed reasoning attempts. Our analysis included statistical comparisons of the average reasoning token length and the top 10 most frequently used words during reasoning. The goal was to understand why the model trained with a reasoning length of 12k achieved an accuracy improvement of over 5%.
 | Dataset Size | Correct/Wrong | Average Tokens | Top 10 Frequently Occurring Words                                                                 |
 |--------------|---------------|----------------|------------------------------------------------------------------------------------------------|
 | 12k          | Wrong         | 15694.54       | the(5.12%) is(2.85%) to(1.64%) and(1.42%) **but(1.27%)** of(1.20%) so(1.08%) **wait(0.80%)** that(0.80%) in(0.75%) |
 # Training Data
+We conducted relevant experiments using our own synthesized [LONG1k](https://huggingface.co/datasets/ZTss/LONG1k) dataset.  LONG1k is a composite dataset generated for model training from two datasets, OpenThoughts114k and s1.1.  We randomly selected two mathematical problems from OpenThoughts114k and concatenated them using linking words to increase prompt length. To avoid overfitting, we also included mathematical problems from the s1.1 dataset that met the length requirements. The ratio of problem lengths and markers was dynamically adjusted in different experiments.
 # Evaluation
 |---|---|---|---|---|---|
 | s1-32B | 1k | 92.6 | 50.0 | 26.7 | 56.6 |
 | s1.1-32B | 1k | 89.0 | 64.7 | 49.3 | 60.1 |
+| LIMO | 0.8k | 94.8 | 57.1 | 49.3 | 66.7 |
+| OpenThinker-32B | 114k | 90.6 | 66.0 | 53.3 | 61.6 |
+| DeepSeek-R1-Distill-Qwen-32B | 800K | 93.0 | 72.6 | 55.9 | 62.1 |
+| Long1-32B | 1K | **95.6** | 50.7 | 53.3 | **71.1** |
 Performance comparison of different models across multiple reasoning benchmarks (pass@1). The best results for each benchmark are highlighted in bold, with the second-best underlined. The data for s1 does not use budget forcing, and the data for s1.1 that does not use budget forcing comes from Open Thoughts.
 # Uses
+Loading...
 # Authors
 - **Name**: Tiansheng Zheng
   **Organization**: Nanjing Agricultural University
+- **Name**:
   **Organization**: Nanjing Agricultural University
 - **Name**: Fei Huang
 - **Name**: Danhao Zhu
   **Email**: zhudanhao@jspi.cn
   **Organization**: Jiangsu Police Institute
+# Code and Dataset
+The code and dataset for this model are available on GitHub: [https://github.com/ZTss/LONG1](https://github.com/ZTss/LONG1)