Improve model card: Add pipeline tag, library name, paper and code links
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,7 +1,84 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
-
datasets:
|
| 4 |
-
- Open-Reasoner-Zero/orz_math_57k_collection
|
| 5 |
base_model:
|
| 6 |
- Qwen/Qwen2.5-7B
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-7B
|
| 4 |
+
datasets:
|
| 5 |
+
- Open-Reasoner-Zero/orz_math_57k_collection
|
| 6 |
+
license: mit
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
+
library_name: transformers
|
| 9 |
+
tags:
|
| 10 |
+
- code-generation
|
| 11 |
+
- tool-use
|
| 12 |
+
- mathematical-reasoning
|
| 13 |
+
- rlhf
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving
|
| 17 |
+
|
| 18 |
+
This repository contains the **ZeroTIR** model, a large language model fine-tuned for mathematical problem solving through spontaneous Python code generation and execution. This model was introduced in the paper:
|
| 19 |
+
|
| 20 |
+
๐ [**Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving**](https://huggingface.co/papers/2505.07773)
|
| 21 |
+
|
| 22 |
+
## Model Description
|
| 23 |
+
|
| 24 |
+
Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. This work investigates RL from outcome-based rewards for Tool-Integrated Reasoning (ZeroTIR), training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. The central contribution is demonstrating that as RL training progresses, key metrics scale predictably. Specifically, strong positive correlations are observed where increased training steps lead to increases in spontaneous code execution frequency, average response length, and critically, final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks.
|
| 25 |
+
|
| 26 |
+
## Usage
|
| 27 |
+
|
| 28 |
+
This model can be loaded and used with the `transformers` library for text generation tasks, specifically for mathematical problem solving by enabling code execution.
|
| 29 |
+
|
| 30 |
+
```python
|
| 31 |
+
import torch
|
| 32 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 33 |
+
|
| 34 |
+
# Assuming the model ID on the Hugging Face Hub is named as below:
|
| 35 |
+
model_id = "Open-Reasoner-Zero/Agent-RL-Scaling-Law-ZeroTIR-Qwen2.5-7B"
|
| 36 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 37 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 38 |
+
model_id,
|
| 39 |
+
torch_dtype=torch.bfloat16,
|
| 40 |
+
device_map="auto"
|
| 41 |
+
)
|
| 42 |
+
|
| 43 |
+
# Example for mathematical problem solving with spontaneous code execution
|
| 44 |
+
# The model is trained to generate code blocks to solve problems.
|
| 45 |
+
prompt = (
|
| 46 |
+
"A rectangle has a perimeter of 40 units. Its length is 3 times its width. "
|
| 47 |
+
"What is the area of the rectangle? Provide your reasoning and use Python code to verify your answer.\
|
| 48 |
+
"
|
| 49 |
+
"```python\
|
| 50 |
+
"
|
| 51 |
+
)
|
| 52 |
+
|
| 53 |
+
input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 54 |
+
|
| 55 |
+
outputs = model.generate(
|
| 56 |
+
**input_ids,
|
| 57 |
+
max_new_tokens=512,
|
| 58 |
+
do_sample=True,
|
| 59 |
+
temperature=0.7,
|
| 60 |
+
top_p=0.9,
|
| 61 |
+
eos_token_id=tokenizer.eos_token_id
|
| 62 |
+
)
|
| 63 |
+
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 64 |
+
print(generated_text)
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
## Code
|
| 68 |
+
|
| 69 |
+
The official code and further details regarding this work can be found on the GitHub repository:
|
| 70 |
+
|
| 71 |
+
๐ [**https://github.com/yyht/openrlhf_async_pipline**](https://github.com/yyht/openrlhf_async_pipline)
|
| 72 |
+
|
| 73 |
+
## Citation
|
| 74 |
+
|
| 75 |
+
If you use this model or the associated research, please cite the paper:
|
| 76 |
+
|
| 77 |
+
```bibtex
|
| 78 |
+
@article{,
|
| 79 |
+
title={Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving},
|
| 80 |
+
author={},
|
| 81 |
+
journal={arXiv preprint arXiv:2505.07773},
|
| 82 |
+
year={2025}
|
| 83 |
+
}
|
| 84 |
+
```
|