---
license: apache-2.0
base_model: Qwen/Qwen3-32B
tags:
- reinforcement-learning
- rl
- ppo
- skyrl
- code
- reasoning
datasets:
- penfever/r2egym_gpt5_codex_solved_tasks_256_subset
pipeline_tag: text-generation
model-index:
- name: Qwen3-32B-R2EGYM-256-3epochs
  results: []
---

# Qwen3-32B-R2EGYM-256-3epochs

This model is a reinforcement learning fine-tuned version of [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B), trained using the SkyRL framework with fully asynchronous PPO on coding and reasoning tasks from the R2EGYM benchmark.

## Training Details

### Framework
- **Training Framework**: [SkyRL](https://github.com/NovaSkyAI/SkyRL) (fully async PPO)
- **Parallelism Strategy**: FSDP2 with CPU offload
- **Agent**: Terminus-2 (terminal-based coding agent with thinking enabled)

### Dataset
- **Dataset**: [penfever/r2egym_gpt5_codex_solved_tasks_256_subset](https://huggingface.co/datasets/penfever/r2egym_gpt5_codex_solved_tasks_256_subset)
- **Number of tasks**: 256
- **Evaluation set**: OpenThoughts-TB-dev (70 tasks)

### Hyperparameters

| Parameter | Value |
|---|---|
| **Epochs** | 3 |
| **Total steps** | 12 (4 steps/epoch) |
| **Learning rate** | 1e-5 |
| **Weight decay** | 0.0 |
| **Train batch size** | 64 |
| **Micro train batch size per GPU** | 1 |
| **Advantage estimator** | rloo_n |
| **KL loss** | disabled |
| **Samples per prompt** | 8 |
| **Max prompt length** | 2,048 |
| **Max generate length** | 30,720 |
| **RoPE scaling** | yarn (factor=4.0, original_max_position_embeddings=32,768) |

### Infrastructure

| Component | Configuration |
|---|---|
| **Policy nodes** | 4 nodes x 4 GPUs |
| **Reference model nodes** | 4 nodes x 4 GPUs |
| **Inference engines** | 26 (tensor parallelism = 2) |
| **Parallel generation workers** | 96 |
| **Concurrent sandbox trials** | 96 |
| **Total training nodes** | 17 |

### Training Notes
- Training was resumed from a step-9 checkpoint
- The model uses Terminus-2, a terminal-based coding agent that interacts with sandboxed Docker environments to solve programming tasks
- Thinking mode was enabled during training (`--enable_thinking`)

## Usage

This model can be used as a drop-in replacement for Qwen3-32B with improved coding and reasoning capabilities.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("laion/Qwen3-32B-R2EGYM-256-3epochs")
tokenizer = AutoTokenizer.from_pretrained("laion/Qwen3-32B-R2EGYM-256-3epochs")
```