|
|
--- |
|
|
base_model: |
|
|
- tencent/DRIVE-SFT |
|
|
library_name: transformers |
|
|
--- |
|
|
<div align="center"> |
|
|
|
|
|
# DRIVE: <font color=#6495ED >D</font>ata Curation Best Practices for <font color=#6495ED >R</font>einforcement Learning w<font color=#6495ED >I</font>th <font color=#6495ED >VE</font>rifiable Reward in Competitive Code Generation |
|
|
|
|
|
**Hunyuan Team, Tencent** |
|
|
|
|
|
</div> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://arxiv.org/abs/2511.06307">π Paper</a> β’ |
|
|
<a href="https://huggingface.co/tencent/DRIVE-SFT">π SFT Model </a> β’ |
|
|
<a href="https://huggingface.co/tencent/DRIVE-RL">π RL Model </a> β’ |
|
|
<a href="#citation"><b>π Citation</b></a> |
|
|
</p> |
|
|
|
|
|
|
|
|
----- |
|
|
|
|
|
## Abstract |
|
|
|
|
|
Recent reasoning-first models have spurred a resurgence of interest in RLVR (Reinforcement Learning with Verifiable Reward). However, advances are dominated by mathematics, with competitive-programming code generation being relatively underexplored. This work investigates how to construct RLVR datasets and presents practical training techniques that yield strong performance. |
|
|
|
|
|
Our pipeline begins with Supervised Fine-Tuning (SFT) distilled from strong open-source models. This is followed by a **two-stage RL process** using executable, testcase-driven rewards: |
|
|
|
|
|
1. **Stage 1 (Entropy Expansion):** Training on a large, uniformly distributed set of problems with moderate rollouts (8) and a shorter context (24k) to expand entropy and mitigate repetition. |
|
|
2. **Stage 2 (Hard-Focus Curriculum):** Updating on a small, high-quality set of *challenging* problems using Pre-GRPO with a large rollout budget (64) under a hard-focus curriculum. |
|
|
|
|
|
We implement our method on Qwen2.5-32B and achieve state-of-the-art performance among models of similar scale, comparable to leading systems like DeepSeek v3.1. |
|
|
|
|
|
## π The DRIVE Pipeline |
|
|
|
|
|
Our training pipeline consists of two main phases: Supervised Fine-Tuning (SFT) and a Two-Stage Reinforcement Learning process, as illustrated below. |
|
|
|
|
|
 |
|
|
|
|
|
> *Figure 2: The training pipeline of our models.* |
|
|
|
|
|
### Phase 1: Supervised Fine-Tuning (SFT) |
|
|
|
|
|
We begin by fine-tuning Qwen2.5-32B. The key innovation in this stage is **Difficulty-Aware Sampling**: |
|
|
|
|
|
* We first classify all competitive programming prompts into three categories: easy, medium, and hard. |
|
|
* To force the model to focus on more challenging problems, we **duplicate hard samples twice** in the final SFT dataset. |
|
|
* We also augment this with general-purpose coding and reasoning-intensive data to improve overall capabilities. |
|
|
|
|
|
### Phase 2: Two-Stage Reinforcement Learning |
|
|
|
|
|
After SFT, the model still suffers from low entropy, repetitive generation, and poor performance on hard problems. Our two-stage RL process directly addresses this. |
|
|
|
|
|
**Stage 1: Entropy Expansion** |
|
|
|
|
|
* **Goal:** Increase output diversity and reduce repetitive patterns. |
|
|
* **Data:** A large, uniformly distributed set of \~9k problems. |
|
|
* **Method:** We use 8 rollouts and a shorter 24k token length. As shown in Figure 3, this "24k-style" training (blue line) successfully increases entropy, while standard training (orange line) leads to entropy collapse. |
|
|
|
|
|
 |
|
|
|
|
|
> *Figure 3: The entropy comparison of 24k-style training and 32k-style training.* |
|
|
|
|
|
**Stage 2: Hard-Focus Curriculum** |
|
|
|
|
|
* **Goal:** Master the most challenging problems. |
|
|
* **Data:** A small, high-quality set of difficult problems (e.g., the 72, 50, and 32 hardest cases from LiveCode V6). |
|
|
* **Method:** We apply a "hard-focus curriculum" that progressively retains only the most difficult instances. Crucially, we use a **large rollout budget (64-80 rollouts)** in this stage, which we found essential for stable gains on hard problems. |
|
|
|
|
|
## π Key Results |
|
|
|
|
|
Our final 32B model, **DRIVE-RL**, achieves state-of-the-art performance among similarly sized models and is competitive with larger 64k-context models. |
|
|
|
|
|
 |
|
|
|
|
|
> *Figure 1: Performance of our models on various benchmarks.* |
|
|
|
|
|
### Pass@1 Performance Comparison |
|
|
|
|
|
The two-stage RL pipeline provides significant improvements over the SFT baseline, particularly on challenging benchmarks. We see a **+58.3% relative improvement** on Codeforces OJ. |
|
|
|
|
|
| Model | LiveCode 08-11 | LiveCode V5 | LiveCode V6 | LeetCode Weekly (32) | Codeforces OJ (33) | |
|
|
| :--- | :---: | :---: | :---: | :---: | :---: | |
|
|
| DeepseekV3.1 (64k) | 0.692 | 0.713 | 0.693 | 0.688 | 0.161 | |
|
|
| Seed1.6-0715 (64k) | 0.803 | 0.824 | 0.770 | 0.743 | 0.188 | |
|
|
| Qwen3-235B-2507 (64k)| 0.681 | 0.713 | 0.646 | 0.688 | 0.200 | |
|
|
| --- | --- | --- | --- | --- | --- | |
|
|
| SFT model (32k) | 0.602 | 0.594 | 0.549 | 0.578 | 0.115 | |
|
|
| RL Stage 1 model (24k) | 0.625 | 0.627 | 0.634 | 0.603 | 0.112 | |
|
|
| **DRIVE-RL model (32k)** | **0.699** | **0.697** | **0.703** | **0.653** | **0.182** | |
|
|
| *Rel. Improvement (RL vs SFT)* | *+16.1%* | *+17.3%* | *+28.1%* | *+13.0%* | *+58.3%* | |
|
|
|
|
|
*(Data sourced from Table 2 in our paper)* |
|
|
|
|
|
### Key Findings |
|
|
|
|
|
1. **Difficulty-aware training is crucial:** Standard RL struggles with hard problems. Our hard-focus curriculum (Stage 2) is essential for pushing the model's capabilities. |
|
|
2. **Entropy expansion is necessary:** Skipping Stage 1 (Entropy Expansion) and training *only* on hard cases hurts generalization to out-of-distribution benchmarks. Both stages are necessary. |
|
|
3. **Large rollouts for hard problems:** A large rollout budget (e.g., 64+) is essential for mastering challenging cases. |
|
|
4. **Scaling:** The DRIVE strategy shows strong, positive scaling trends when applied to a large-scale internal MoE model. |
|
|
|
|
|
<a id="citation"></a> |
|
|
## π Citation |
|
|
|
|
|
If you find this work useful, please cite our paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{zhu2025drivedatacurationbest, |
|
|
title={DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation}, |
|
|
author={Speed Zhu and Jianwei Cai and Guang Chen and Lulu Wu and Saiyong Yang and Wiggin Zhou}, |
|
|
year={2025}, |
|
|
eprint={2511.06307}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.LG}, |
|
|
url={https://arxiv.org/abs/2511.06307}, |
|
|
} |
|
|
``` |
|
|
|
|
|
|
|
|
## License |
|
|
|
|
|
This repository contains two separate licenses for different models: |
|
|
|
|
|
- **DRIVE-RL Model**: Licensed under [LICENSE.txt](LICENSE.txt) |
|
|
|
|
|
Please refer to the respective license file for the model you are using. |
|
|
|