LONG1k-32B / README.md

nielsr HF Staff

Improve model card with Github link

bdc6cbf verified about 1 year ago

7.28 kB

base_model:
  - Qwen/Qwen2.5-32B-Instruct
datasets:
  - LONG-1k
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation

Long Is More Important Than Difficult for Training Reasoning Models

The model was presented in the paper Long Is More Important Than Difficult for Training Reasoning Models.

Paper abstract

Difficult problems, which often result in long reasoning traces, are widely recognized as key factors for enhancing the performance of reasoning models. However, such high-challenge problems are scarce, limiting the size of available datasets. In this paper, we propose a simple method to decouple the reliance on problem difficulty. First, we empirically demonstrate that reasoning length, rather than problem difficulty, primarily influences the performance of trained models. Second, we identify a scaling law on reasoning length, showing that model performance increases in a log-linear fashion as the reasoning data length grows. Finally, we introduce a straightforward technique to generate reasoning data of arbitrary length, and show that synthesized data is effective for training reasoning models. After fine-tuning the Qwen2.5-32B-Instruct language model on our Long1K dataset, we present our model, Long1K-32B, which achieves remarkable performance with only 1,000 training samples, achieving 95.6% accuracy on MATH, and 71.1% on GPQA outperforming DeepSeek-R1-Distill-Qwen-32B. The model, code, and dataset are all open-sourced, available at https://huggingface.co/ZTss/LONG1.

Model Description

This model, Long1K-32B, is a fine-tuned version of Qwen2.5-32B-Instruct trained on the Long1K dataset. The key finding of the accompanying paper is that reasoning length, more so than problem difficulty, significantly impacts the performance of reasoning models.

Key Findings

Challenging a common assumption: The paper challenges the prevalent belief that problem difficulty is the most critical factor in training high-performance reasoning models. Experiments suggest that reasoning length is key.
Identifying a scaling law on reasoning length: Model performance improves nearly linearly as the length of training data increases exponentially.
Proposing a simple synthesis method: A technique for generating arbitrarily long reasoning data is introduced. The Long1K dataset, used to fine-tune Long1K-32B, is created using this method.

Detail

Among the work of the thesis, we firstly did two sets of experiments, namely, conceptual synthetic long problems with conceptual synthetic difficult problems, and synthetic long problems with original difficult problems. The related results are shown in the following figure. It turns out that the models perform similarly in mathematical reasoning when the training token lengths are similar. We conclude that the key factor affecting the model's reasoning effectiveness is not the difficulty.

Therefore, we shifted our focus from the difficulty of mathematical problems to the length of mathematical problems. We made the assumption that length is the key factor in constructing inference models. To this end, we explored the effect of different token lengths on the reasoning ability of the model at the same difficulty level. Firstly, we classified the token length into 4 levels, whose lengths are 1.5k, 3k, 6k, 12k. Then, we set the number of questions to 500, and conducted experimental validation on the Qwen2.5-32B model. The results are shown below. The data show that on the math500 dataset, the performance is close to linearly increasing as the length increases.

In addition, we compared the reasoning processes of two models trained with reasoning lengths of 1.5k and 12k, respectively, on the MATH500 test set, including both successful and failed reasoning attempts. Our analysis included statistical comparisons of the average reasoning token length and the top 10 most frequently used words during reasoning. The goal was to understand why the model trained with a reasoning length of 12k achieved an accuracy improvement of over 5%.

Dataset Size	Correct/Wrong	Average Tokens	Top 10 Frequently Occurring Words
1.5k	Correct	2147.65	the(5.30%) is(3.24%) so(1.98%) of(1.45%) to(1.44%) and(1.25%) that(1.17%) let(1.08%) wait(1.07%) but(0.91%)
12k	Correct	4716.27	the(4.92%) is(3.04%) so(1.83%) to(1.41%) of(1.25%) and(1.19%) but(1.19%) let(0.93%) that(0.90%) wait(0.81%)
1.5k	Wrong	8247.21	but(5.05%) the(5.00%) wait(3.78%) is(3.24%) of(1.29%) so(1.26%) therefore(1.16%) to(1.08%) and(1.01%) that(0.70%)
12k	Wrong	15694.54	the(5.12%) is(2.85%) to(1.64%) and(1.42%) but(1.27%) of(1.20%) so(1.08%) wait(0.80%) that(0.80%) in(0.75%)

Training Data

We conducted relevant experiments using our own synthesized LONG1k dataset. LONG1k is a composite dataset generated for model training from two datasets, OpenThoughts114k and s1.1. We randomly selected two mathematical problems from OpenThoughts114k and concatenated them using linking words to increase prompt length. To avoid overfitting, we also included mathematical problems from the s1.1 dataset that met the length requirements. The ratio of problem lengths and markers was dynamically adjusted in different experiments.

Evaluation

Model	Dataset Size	MATH_500	AIME_2024	AIME_2025	GPQA_Diamond
s1-32B	1k	92.6	50.0	26.7	56.6
s1.1-32B	1k	89.0	64.7	49.3	60.1
LIMO	0.8k	94.8	57.1	49.3	66.7
OpenThinker-32B	114k	90.6	66.0	53.3	61.6
DeepSeek-R1-Distill-Qwen-32B	800K	93.0	72.6	55.9	62.1
Long1-32B	1K	95.6	50.7	53.3	71.1

Performance comparison of different models across multiple reasoning benchmarks (pass@1). The best results for each benchmark are highlighted in bold, with the second-best underlined. The data for s1 does not use budget forcing, and the data for s1.1 that does not use budget forcing comes from Open Thoughts.

Uses

Authors

Name: Si Shen Organization: Nanjing University of Science and Technology
Name: Zhixiao Zhao Organization: Nanjing Agricultural University
Name: Chang Liu Organization: Nanjing Agricultural University
Name: Tiansheng Zheng Organization: Nanjing Agricultural University
Name:
Organization: Nanjing Agricultural University
Name: Fei Huang Organization: Nanjing University of Science and Technology
Name: Danhao Zhu Email: zhudanhao@jspi.cn Organization: Jiangsu Police Institute

Code and Dataset

The code and dataset for this model are available on GitHub: https://github.com/ZTss/LONG1