--- license: mit datasets: - RLVR-SvS/Variational-DAPO language: - en metrics: - accuracy base_model: - Qwen/Qwen2.5-7B-Instruct pipeline_tag: reinforcement-learning --- # Model Card for SvS-Code-7B (from Qwen2.5-7B-Instruct)

[🌐 Website][🤗 Dataset][🤖 Models][📜 Paper][🐱 GitHub][🐦 Twitter][📕 Rednote]

The official model checkpoints for SvS. The SvS model is trained on a subset of coding tasks from PRIME-RL dataset (included in this repository as 12k_code_rl.parquet). # Inference We recommend using our official inference template from Qwen2.5 Instruct models. ```python model_name = "RLVR-SvS/SvS-Qwen-Code-7B" device = "cuda" # the device to load the model onto model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "write a quick sort algorithm." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device) generated_ids = model.generate( **model_inputs, max_new_tokens=8192 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] ``` # Cite Us If you find the model helpful, please consider citing our paper: ``` @misc{liang2025pass1selfplayvariationalproblem, title={Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR}, author={Xiao Liang and Zhongzhi Li and Yeyun Gong and Yelong Shen and Ying Nian Wu and Zhijiang Guo and Weizhu Chen}, year={2025}, eprint={2508.14029}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.14029}, } ```