| | --- |
| | base_model: |
| | - deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
| | language: |
| | - en |
| | license: mit |
| | metrics: |
| | - accuracy |
| | pipeline_tag: text-generation |
| | library_name: transformers |
| | tags: |
| | - RLinf |
| | - reinforcement-learning |
| | model-index: |
| | - name: RLinf-math-7B |
| | results: |
| | - task: |
| | type: math |
| | dataset: |
| | name: AIME24 |
| | type: aime_2024 |
| | metrics: |
| | - type: accuracy |
| | value: 68.328125 |
| | - task: |
| | type: math |
| | dataset: |
| | name: AIME25 |
| | type: aime_2025 |
| | metrics: |
| | - type: accuracy |
| | value: 52.19375 |
| | - task: |
| | type: stem |
| | dataset: |
| | name: GPQA-diamond |
| | type: gpqa_diamond |
| | metrics: |
| | - type: accuracy |
| | value: 48.178124999999994 |
| | --- |
| | |
| | <div align="center"> |
| | <img src="logo.svg" alt="RLinf-logo" width="500"/> |
| | </div> |
| |
|
| | The model was presented in the paper [RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training](https://huggingface.co/papers/2510.06710). |
| |
|
| | <div align="center"> |
| | <!-- <a href="TODO"><img src="https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv"></a> --> |
| | <!-- <a href="TODO"><img src="https://img.shields.io/badge/HuggingFace-yellow?logo=huggingface&logoColor=white" alt="Hugging Face"></a> --> |
| | <a href="https://github.com/RLinf/RLinf"><img src="https://img.shields.io/badge/Github-blue"></a> |
| | <a href="https://rlinf.readthedocs.io/en/latest/"><img src="https://img.shields.io/badge/Documentation-Purple?color=8A2BE2&logo=readthedocs"></a> |
| | <!-- <a href="TODO"><img src="https://devin.ai/assets/deepwiki-badge.png" alt="Ask DeepWiki.com" style="height:20px;"></a> |
| | <a href="TODO"><img src="https://img.shields.io/badge/微信-green?logo=wechat&"></a> --> |
| | </div> |
| |
|
| | <h1 align="center">RLinf: Reinforcement Learning Infrastructure for Agentic AI</h1> |
| |
|
| | [RLinf](https://github.com/RLinf/RLinf) is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development. |
| |
|
| |
|
| | <div align="center"> |
| | <img src="overview.png" alt="RLinf-overview" width="600"/> |
| | </div> |
| |
|
| | ## Model Description |
| | The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields SOTA performance. |
| |
|
| | We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks. |
| |
|
| | ## Evaluation and Results |
| | We trained and evaluated two models using RLinf: |
| |
|
| | - RLinf-math-1.5B Model (based on DeepSeek-R1-Distill-Qwen-1.5B) |
| | - Recommended sampling settings: `temperature = 0.6`, `top_p = 0.95` |
| |
|
| | - RLinf-math-7B Model (based on DeepSeek-R1-Distill-Qwen-7B) |
| | - Recommended sampling settings: `temperature = 1.0`, `top_p = 0.95` |
| |
|
| | ### Benchmark Results |
| |
|
| | **1.5B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-1.5B using RL. |
| |
|
| | | Model | AIME 24 | AIME 25 | GPQA-diamond | Average | |
| | | ------------------------------------------ | --------- | --------- | ------------ | --------- | |
| | | [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | 28.33 | 24.90 | 27.45 | 26.89 | |
| | | [DeepMath-1.5B](https://huggingface.co/zwhe99/DeepMath-1.5B) | 37.80 | 30.42 | 32.11 | 33.44 | |
| | | [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview) | 40.41 | 30.93 | 27.54 | 32.96 | |
| | | [AReaL-1.5B-Preview-Stage-3](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3) | 40.73 | 31.56 | 28.10 | 33.46 | |
| | | AReaL-1.5B-retrain* | 44.42 | 34.27 | 33.81 | 37.50 | |
| | | [FastCuRL-1.5B-V3](https://huggingface.co/Nickyang/FastCuRL-1.5B-V3) | 43.65 | 32.49 | 35.00 | 37.05 | |
| | | [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B) | **48.44** | **35.63** | **38.46** | **40.84** | |
| |
|
| | \* We retrain the model using the default settings for 600 steps. |
| |
|
| | **7B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-7B using RL. |
| |
|
| | | Model | AIME 24 | AIME 25 | GPQA-diamond | Average | |
| | | ---------------------------------------- | --------- | --------- | ------------ | --------- | |
| | | [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | 54.90 | 40.20 | 45.48 | 46.86 | |
| |
|
| | | [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) | 61.66 | 49.38 | 46.93 | 52.66 | |
| |
|
| | | [Skywork-OR1-7B](https://huggingface.co/Skywork/Skywork-OR1-7B) | 66.87 | 52.49 | 44.43 | 54.60 | |
| |
|
| | | [Polaris-7B-Preview](https://huggingface.co/POLARIS-Project/Polaris-7B-Preview) | **68.55** | 51.24 | 43.88 | 54.56 | |
| |
|
| | | [AceMath-RL-Nemotron-7B](https://huggingface.co/nvidia/AceMath-RL-Nemotron-7B) | 67.30 | **55.00** | 45.57 | 55.96 | |
| |
|
| | | [RLinf-math-7B](https://huggingface.co/RLinf/RLinf-math-7B) | 68.33 | 52.19 | **48.18** | **56.23** | |
| |
|
| |
|
| |
|
| | ## How to Use |
| | Example with Hugging Face `transformers`: |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model_name = "RLinf/RLinf-math-7B" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto") |
| | |
| | prompt = "Solve: If x^2 + 2x + 1 = 0, what is x?" |
| | |
| | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| | outputs = model.generate( |
| | **inputs, |
| | max_new_tokens=512, |
| | temperature=1.0, # recommended for 7B |
| | top_p=0.95 |
| | ) |
| | |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ## License |
| | This code repository and the model weights are licensed under the MIT License. |