LONG1k-32B / README.md
nielsr's picture
nielsr HF Staff
Improve model card with Github link
bdc6cbf verified
|
raw
history blame
7.28 kB
metadata
base_model:
  - Qwen/Qwen2.5-32B-Instruct
datasets:
  - LONG-1k
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation

Long Is More Important Than Difficult for Training Reasoning Models

The model was presented in the paper Long Is More Important Than Difficult for Training Reasoning Models.

Paper abstract

Difficult problems, which often result in long reasoning traces, are widely recognized as key factors for enhancing the performance of reasoning models. However, such high-challenge problems are scarce, limiting the size of available datasets. In this paper, we propose a simple method to decouple the reliance on problem difficulty. First, we empirically demonstrate that reasoning length, rather than problem difficulty, primarily influences the performance of trained models. Second, we identify a scaling law on reasoning length, showing that model performance increases in a log-linear fashion as the reasoning data length grows. Finally, we introduce a straightforward technique to generate reasoning data of arbitrary length, and show that synthesized data is effective for training reasoning models. After fine-tuning the Qwen2.5-32B-Instruct language model on our Long1K dataset, we present our model, Long1K-32B, which achieves remarkable performance with only 1,000 training samples, achieving 95.6% accuracy on MATH, and 71.1% on GPQA outperforming DeepSeek-R1-Distill-Qwen-32B. The model, code, and dataset are all open-sourced, available at https://huggingface.co/ZTss/LONG1.

Model Description

This model, Long1K-32B, is a fine-tuned version of Qwen2.5-32B-Instruct trained on the Long1K dataset. The key finding of the accompanying paper is that reasoning length, more so than problem difficulty, significantly impacts the performance of reasoning models.

Key Findings

  1. Challenging a common assumption: The paper challenges the prevalent belief that problem difficulty is the most critical factor in training high-performance reasoning models. Experiments suggest that reasoning length is key.
  2. Identifying a scaling law on reasoning length: Model performance improves nearly linearly as the length of training data increases exponentially.
  3. Proposing a simple synthesis method: A technique for generating arbitrarily long reasoning data is introduced. The Long1K dataset, used to fine-tune Long1K-32B, is created using this method.

Detail

Among the work of the thesis, we firstly did two sets of experiments, namely, conceptual synthetic long problems with conceptual synthetic difficult problems, and synthetic long problems with original difficult problems. The related results are shown in the following figure. It turns out that the models perform similarly in mathematical reasoning when the training token lengths are similar. We conclude that the key factor affecting the model's reasoning effectiveness is not the difficulty.

img_3.png

Therefore, we shifted our focus from the difficulty of mathematical problems to the length of mathematical problems. We made the assumption that length is the key factor in constructing inference models. To this end, we explored the effect of different token lengths on the reasoning ability of the model at the same difficulty level. Firstly, we classified the token length into 4 levels, whose lengths are 1.5k, 3k, 6k, 12k. Then, we set the number of questions to 500, and conducted experimental validation on the Qwen2.5-32B model. The results are shown below. The data show that on the math500 dataset, the performance is close to linearly increasing as the length increases.

img_2.png

In addition, we compared the reasoning processes of two models trained with reasoning lengths of 1.5k and 12k, respectively, on the MATH500 test set, including both successful and failed reasoning attempts. Our analysis included statistical comparisons of the average reasoning token length and the top 10 most frequently used words during reasoning. The goal was to understand why the model trained with a reasoning length of 12k achieved an accuracy improvement of over 5%.

Dataset Size Correct/Wrong Average Tokens Top 10 Frequently Occurring Words
1.5k Correct 2147.65 the(5.30%) is(3.24%) so(1.98%) of(1.45%) to(1.44%) and(1.25%) that(1.17%) let(1.08%) wait(1.07%) but(0.91%)
12k Correct 4716.27 the(4.92%) is(3.04%) so(1.83%) to(1.41%) of(1.25%) and(1.19%) but(1.19%) let(0.93%) that(0.90%) wait(0.81%)
1.5k Wrong 8247.21 but(5.05%) the(5.00%) wait(3.78%) is(3.24%) of(1.29%) so(1.26%) therefore(1.16%) to(1.08%) and(1.01%) that(0.70%)
12k Wrong 15694.54 the(5.12%) is(2.85%) to(1.64%) and(1.42%) but(1.27%) of(1.20%) so(1.08%) wait(0.80%) that(0.80%) in(0.75%)

Training Data

We conducted relevant experiments using our own synthesized LONG1k dataset. LONG1k is a composite dataset generated for model training from two datasets, OpenThoughts114k and s1.1. We randomly selected two mathematical problems from OpenThoughts114k and concatenated them using linking words to increase prompt length. To avoid overfitting, we also included mathematical problems from the s1.1 dataset that met the length requirements. The ratio of problem lengths and markers was dynamically adjusted in different experiments.

Evaluation

Model Dataset Size MATH_500 AIME_2024 AIME_2025 GPQA_Diamond
s1-32B 1k 92.6 50.0 26.7 56.6
s1.1-32B 1k 89.0 64.7 49.3 60.1
LIMO 0.8k 94.8 57.1 49.3 66.7
OpenThinker-32B 114k 90.6 66.0 53.3 61.6
DeepSeek-R1-Distill-Qwen-32B 800K 93.0 72.6 55.9 62.1
Long1-32B 1K 95.6 50.7 53.3 71.1

Performance comparison of different models across multiple reasoning benchmarks (pass@1). The best results for each benchmark are highlighted in bold, with the second-best underlined. The data for s1 does not use budget forcing, and the data for s1.1 that does not use budget forcing comes from Open Thoughts.

Uses

Loading...

Authors

  • Name: Si Shen Organization: Nanjing University of Science and Technology

  • Name: Zhixiao Zhao Organization: Nanjing Agricultural University

  • Name: Chang Liu Organization: Nanjing Agricultural University

  • Name: Tiansheng Zheng Organization: Nanjing Agricultural University

  • Name:
    Organization: Nanjing Agricultural University

  • Name: Fei Huang Organization: Nanjing University of Science and Technology

  • Name: Danhao Zhu Email: zhudanhao@jspi.cn Organization: Jiangsu Police Institute

Code and Dataset

The code and dataset for this model are available on GitHub: https://github.com/ZTss/LONG1