Composition-RL-8B

Composition-RL is a data-efficient Reinforcement Learning with Verifiable Rewards (RLVR) approach that addresses the scarcity of informative training signals by automatically composing multiple verifiable problems into a single, harder compositional prompt.

This specific checkpoint is the 8B version, initialized from Qwen3-8B-Base and trained on the MATH-Composition-199K dataset.

Model Description

As training progresses in RLVR, models often master "easy" prompts, resulting in a pass rate of 1 and reducing effective learning. Composition-RL mitigates this by creating new, complex, yet verifiable questions from existing data, maintaining a high level of difficulty and informative signals throughout training.

Usage

For evaluation and data generation instructions, please refer to the official GitHub repository.

Citation

If you find this work helpful for your research, please consider citing:

@article{xu2026composition-rl,
  title={Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models},
  author={Xu, Xin and Bai, Clive and Yang, Kai Rural and Chen, Tianhao and Chen, Yangkun and Liu, Weijie and Chen, Hao and Wang, Yang and Yang, Saiyong and Yang, Can},
  journal={arXiv preprint arXiv:2602.12036},
  year={2026}
}
Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xx18/Composition-RL-8B

Quantizations
2 models

Collection including xx18/Composition-RL-8B

Paper for xx18/Composition-RL-8B