---
license: apache-2.0
base_model: Qwen/Qwen3-4B-Thinking-2507
library_name: transformers
pipeline_tag: text-generation
datasets:
- Spreadsheet-RL/Spreadsheet-RL
language:
- en
tags:
- spreadsheet
- excel
- reinforcement-learning
- grpo
- agents
- tool-use
- verl
- qwen3
---

# Spreadsheet-RL-4B

[**Project Page**](https://spreadsheet-rl.github.io/) | [**Paper**](https://arxiv.org/abs/2605.22642) | [**Dataset**](https://huggingface.co/datasets/Spreadsheet-RL/Spreadsheet-RL) | [**Code**](https://github.com/Spreadsheet-RL/Spreadsheet-RL)

Spreadsheet-RL-4B is the RL-trained 4B spreadsheet agent checkpoint from **Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning**. It starts from [`Qwen/Qwen3-4B-Thinking-2507`](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) and is post-trained with outcome-based reinforcement learning in Spreadsheet Gym, a multi-turn Microsoft Excel environment with spreadsheet-native tools, sandboxed code execution, and Excel-based recalculation rewards.

This checkpoint is intended to be used with the Spreadsheet-RL agent harness and tool environment. Loading it as a plain chat model can be useful for inspection, but it will not reproduce the paper results without Spreadsheet Gym, the tool set, and the reward/evaluation pipeline.

## News

- 2026-05-23: Released the Spreadsheet-RL-4B model checkpoint on Hugging Face at [`Spreadsheet-RL/Spreadsheet-RL-4B`](https://huggingface.co/Spreadsheet-RL/Spreadsheet-RL-4B).

## Model Details

| Field | Value |
| --- | --- |
| Base model | [`Qwen/Qwen3-4B-Thinking-2507`](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) |
| Training method | GRPO with outcome-based rewards |
| Environment | Spreadsheet Gym with Microsoft Excel 365, spreadsheet-native tools, SandboxFusion code execution, and async Excel recalculation/reward service |
| Training data | Spreadsheet-RL training split: 5,928 filtered ExcelForum tasks |
| Evaluation | SpreadsheetBench and Domain-Spreadsheet |
| License | Apache-2.0, following the base model license |

## Training Configuration

For full details, please see the paper. The released 4B run uses:

| Hyperparameter | Value |
| --- | --- |
| Algorithm | GRPO; KL-regularized against a frozen reference model |
| Training steps | 60 |
| Prompt/response limits | 4,096 / 27,648 tokens |
| Rollout sampling | temperature 0.6; top-p 0.95; top-k 20 |
| Batching | 64 prompts/step; 16 rollouts/prompt; 1,024 rollouts/step |
| Multi-turn caps | max assistant turns 20; max user turns 20; max tool-response length 8,192 |
| Optimizer | AdamW; learning rate 1e-6; weight decay 0.01; betas (0.9, 0.999); grad clip 1.0 |
| KL loss | low-var KL; coefficient 0.001 |
| Actor update batching | mini-batch 32; dynamic batch sizing enabled |
| Hardware | 1 node x 4 NVIDIA H100 GPUs |
| Training time | about 40 hours wall-clock for the 4B run |

## Results

Spreadsheet-RL improves the same 4B base model through spreadsheet-native interaction design, comprehensive tool access, and RL post-training.

| Benchmark | Base | + Native Harness | + Full Tools | Spreadsheet-RL-4B |
| --- | ---: | ---: | ---: | ---: |
| SpreadsheetBench Pass@1 | 12.0 | 15.6 | 19.3 | 23.4 |

On Domain-Spreadsheet, Spreadsheet-RL improves overall Pass@1 from 8.4 to 17.2 over 1,660 evaluation rollouts.

## Usage

Install the standard Transformers stack and load the checkpoint:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Spreadsheet-RL/Spreadsheet-RL-4B"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
```

For task evaluation and agent rollouts, use the full Spreadsheet-RL codebase with the released dataset and Spreadsheet Gym:

```bash
hf download Spreadsheet-RL/Spreadsheet-RL --repo-type dataset --local-dir data
git clone https://github.com/Spreadsheet-RL/Spreadsheet-RL.git
```

The default training/evaluation harness is maintained in the code repository under `configs/`, `scripts/`, `reward/`, and `verl/`.

## Citation

```bibtex
@misc{chi2026spreadsheetrl,
  title         = {Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning},
  author        = {Banghao Chi and Yining Xie and Mingyuan Wu and Jingcheng Yang and Jize Jiang and Zhaoheng Li and Shengyi Qian and Minjia Zhang and Klara Nahrstedt and Rui Hou and Xiangjun Fan and Hanchao Yu},
  year          = {2026},
  eprint        = {2605.22642},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  doi           = {10.48550/arXiv.2605.22642},
  url           = {https://arxiv.org/abs/2605.22642}
}
```