seta-env-rl

SETA-RL: Qwen3-8B Fine-tuned with Reinforcement Learning

This model is Qwen3-8B fine-tuned with reinforcement learning (GRPO via the AReaL framework) on the SETA training dataset.

It is a checkpoint released alongside the paper SETA: Scaling Environments for Terminal Agents (anonymous submission under double-blind review).

Model Details


Base model	Qwen/Qwen3-8B
Training method	GRPO (Group Relative Policy Optimization)
Training data	SETA-Synth — synthesized terminal-agent tasks
Reward function	`pass_ratio_with_bonus` (+0.5 bonus when all unit tests pass)
Context length	32 768 tokens
Thinking	Disabled during training (`/no_think`)

Intended Use

This model is designed for terminal agent tasks: completing multi-step shell-based tasks inside a Docker container environment using tools such as shell_exec, shell_view, and shell_write_content_to_file.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "AnonymousSubmissionUnderDouble-BlindRevi/seta-env-rl"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

For evaluation with the SETA framework, serve via SGLang and run:

python -m sglang.launch_server --model AnonymousSubmissionUnderDouble-BlindRevi/seta-env-rl --port 30000

python scripts/evaluation/eval.py \
    --config scripts/evaluation/configs/eval_default_qwen3_8b.yaml \
    terminal_env.model.model_type=AnonymousSubmissionUnderDouble-BlindRevi/seta-env-rl \
    terminal_env.model.url=http://localhost:30000/v1 \
    dataset=seta-env

Training Configuration

Key hyperparameters (see scripts/areal/configs/config_train_local_seta_env.yaml in the companion code repository):

Hyperparameter	Value
Learning rate	1.70e-5
LR scheduler	constant
Optimizer	AdamW (β₁=0.9, β₂=0.999)
Weight decay	0.017
ε-clip	0.4
Reward scaling	10.0
Reward bias	−0.5
KL coefficient	0.0
Trajectories per task	16
Total epochs	40
GPUs	8 × (4 rollout + 2 trainer)

Limitations

Evaluated on terminal-agent benchmarks; performance on general language tasks is not characterized.
The model operates without chain-of-thought reasoning (/no_think mode).

License

Apache 2.0 (inherited from Qwen3-8B base).

Downloads last month: 26

Safetensors

Model size

8B params

Tensor type

BF16

Video Preview

Reinforcement Learning

Model tree for AnonymousSubmissionUnderDouble-BlindRevi/seta-env-rl

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

(1579)

this model