DRIVE-RL / README.md

Upload folder using huggingface_hub

d416e3d verified 30 days ago

6.38 kB

	---
	base_model:
	- tencent/DRIVE-SFT
	library_name: transformers
	---
	<div align="center">

	# DRIVE: <font color=#6495ED >D</font>ata Curation Best Practices for <font color=#6495ED >R</font>einforcement Learning w<font color=#6495ED >I</font>th <font color=#6495ED >VE</font>rifiable Reward in Competitive Code Generation

	Hunyuan Team, Tencent

	</div>

	<p align="center">
	<a href="https://arxiv.org/abs/2511.06307">📖 Paper</a> •
	<a href="https://huggingface.co/tencent/DRIVE-SFT">📙 SFT Model </a> •
	<a href="https://huggingface.co/tencent/DRIVE-RL">📘 RL Model </a> •
	<a href="#citation"><b>📜 Citation</b></a>
	</p>


	-----

	## Abstract

	Recent reasoning-first models have spurred a resurgence of interest in RLVR (Reinforcement Learning with Verifiable Reward). However, advances are dominated by mathematics, with competitive-programming code generation being relatively underexplored. This work investigates how to construct RLVR datasets and presents practical training techniques that yield strong performance.

	Our pipeline begins with Supervised Fine-Tuning (SFT) distilled from strong open-source models. This is followed by a two-stage RL process using executable, testcase-driven rewards:

	1. Stage 1 (Entropy Expansion): Training on a large, uniformly distributed set of problems with moderate rollouts (8) and a shorter context (24k) to expand entropy and mitigate repetition.
	2. Stage 2 (Hard-Focus Curriculum): Updating on a small, high-quality set of challenging problems using Pre-GRPO with a large rollout budget (64) under a hard-focus curriculum.

	We implement our method on Qwen2.5-32B and achieve state-of-the-art performance among models of similar scale, comparable to leading systems like DeepSeek v3.1.

	## 🚀 The DRIVE Pipeline

	Our training pipeline consists of two main phases: Supervised Fine-Tuning (SFT) and a Two-Stage Reinforcement Learning process, as illustrated below.

	![pipeline_overview](assets/pipeline_overview.png)

	> Figure 2: The training pipeline of our models.

	### Phase 1: Supervised Fine-Tuning (SFT)

	We begin by fine-tuning Qwen2.5-32B. The key innovation in this stage is Difficulty-Aware Sampling:

	* We first classify all competitive programming prompts into three categories: easy, medium, and hard.
	* To force the model to focus on more challenging problems, we duplicate hard samples twice in the final SFT dataset.
	* We also augment this with general-purpose coding and reasoning-intensive data to improve overall capabilities.

	### Phase 2: Two-Stage Reinforcement Learning

	After SFT, the model still suffers from low entropy, repetitive generation, and poor performance on hard problems. Our two-stage RL process directly addresses this.

	Stage 1: Entropy Expansion

	* Goal: Increase output diversity and reduce repetitive patterns.
	* Data: A large, uniformly distributed set of \~9k problems.
	* Method: We use 8 rollouts and a shorter 24k token length. As shown in Figure 3, this "24k-style" training (blue line) successfully increases entropy, while standard training (orange line) leads to entropy collapse.

	![entropy_vs_steps](assets/entropy_vs_steps.png)

	> Figure 3: The entropy comparison of 24k-style training and 32k-style training.

	Stage 2: Hard-Focus Curriculum

	* Goal: Master the most challenging problems.
	* Data: A small, high-quality set of difficult problems (e.g., the 72, 50, and 32 hardest cases from LiveCode V6).
	* Method: We apply a "hard-focus curriculum" that progressively retains only the most difficult instances. Crucially, we use a large rollout budget (64-80 rollouts) in this stage, which we found essential for stable gains on hard problems.

	## 📊 Key Results

	Our final 32B model, DRIVE-RL, achieves state-of-the-art performance among similarly sized models and is competitive with larger 64k-context models.

	![](assets/model_performance_comparison.png)

	> Figure 1: Performance of our models on various benchmarks.

	### Pass@1 Performance Comparison

	The two-stage RL pipeline provides significant improvements over the SFT baseline, particularly on challenging benchmarks. We see a +58.3% relative improvement on Codeforces OJ.

	\| Model \| LiveCode 08-11 \| LiveCode V5 \| LiveCode V6 \| LeetCode Weekly (32) \| Codeforces OJ (33) \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| DeepseekV3.1 (64k) \| 0.692 \| 0.713 \| 0.693 \| 0.688 \| 0.161 \|
	\| Seed1.6-0715 (64k) \| 0.803 \| 0.824 \| 0.770 \| 0.743 \| 0.188 \|
	\| Qwen3-235B-2507 (64k)\| 0.681 \| 0.713 \| 0.646 \| 0.688 \| 0.200 \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| SFT model (32k) \| 0.602 \| 0.594 \| 0.549 \| 0.578 \| 0.115 \|
	\| RL Stage 1 model (24k) \| 0.625 \| 0.627 \| 0.634 \| 0.603 \| 0.112 \|
	\| DRIVE-RL model (32k) \| 0.699 \| 0.697 \| 0.703 \| 0.653 \| 0.182 \|
	\| Rel. Improvement (RL vs SFT) \| +16.1% \| +17.3% \| +28.1% \| +13.0% \| +58.3% \|

	(Data sourced from Table 2 in our paper)

	### Key Findings

	1. Difficulty-aware training is crucial: Standard RL struggles with hard problems. Our hard-focus curriculum (Stage 2) is essential for pushing the model's capabilities.
	2. Entropy expansion is necessary: Skipping Stage 1 (Entropy Expansion) and training only on hard cases hurts generalization to out-of-distribution benchmarks. Both stages are necessary.
	3. Large rollouts for hard problems: A large rollout budget (e.g., 64+) is essential for mastering challenging cases.
	4. Scaling: The DRIVE strategy shows strong, positive scaling trends when applied to a large-scale internal MoE model.

	<a id="citation"></a>
	## 📜 Citation

	If you find this work useful, please cite our paper:

	```bibtex
	@misc{zhu2025drivedatacurationbest,
	title={DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation},
	author={Speed Zhu and Jianwei Cai and Guang Chen and Lulu Wu and Saiyong Yang and Wiggin Zhou},
	year={2025},
	eprint={2511.06307},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2511.06307},
	}
	```


	## License

	This repository contains two separate licenses for different models:

	- DRIVE-RL Model: Licensed under [LICENSE.txt](LICENSE.txt)

	Please refer to the respective license file for the model you are using.