SETA-RL: Qwen3-8B Fine-tuned with Reinforcement Learning

This model is Qwen3-8B fine-tuned with reinforcement learning (GRPO via the AReaL framework) on the SETA training dataset.

It is a checkpoint released alongside the paper SETA: Scaling Environments for Terminal Agents (anonymous submission under double-blind review).

Model Details

Base model Qwen/Qwen3-8B
Training method GRPO (Group Relative Policy Optimization)
Training data SETA-Synth — synthesized terminal-agent tasks
Reward function pass_ratio_with_bonus (+0.5 bonus when all unit tests pass)
Context length 32 768 tokens
Thinking Disabled during training (/no_think)

Intended Use

This model is designed for terminal agent tasks: completing multi-step shell-based tasks inside a Docker container environment using tools such as shell_exec, shell_view, and shell_write_content_to_file.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "AnonymousSubmissionUnderDouble-BlindRevi/seta-env-rl"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

For evaluation with the SETA framework, serve via SGLang and run:

python -m sglang.launch_server --model AnonymousSubmissionUnderDouble-BlindRevi/seta-env-rl --port 30000

python scripts/evaluation/eval.py \
    --config scripts/evaluation/configs/eval_default_qwen3_8b.yaml \
    terminal_env.model.model_type=AnonymousSubmissionUnderDouble-BlindRevi/seta-env-rl \
    terminal_env.model.url=http://localhost:30000/v1 \
    dataset=seta-env

Training Configuration

Key hyperparameters (see scripts/areal/configs/config_train_local_seta_env.yaml in the companion code repository):

Hyperparameter Value
Learning rate 1.70e-5
LR scheduler constant
Optimizer AdamW (β₁=0.9, β₂=0.999)
Weight decay 0.017
ε-clip 0.4
Reward scaling 10.0
Reward bias −0.5
KL coefficient 0.0
Trajectories per task 16
Total epochs 40
GPUs 8 × (4 rollout + 2 trainer)

Limitations

  • Evaluated on terminal-agent benchmarks; performance on general language tasks is not characterized.
  • The model operates without chain-of-thought reasoning (/no_think mode).

License

Apache 2.0 (inherited from Qwen3-8B base).

Downloads last month
26
Safetensors
Model size
8B params
Tensor type
BF16
·
Video Preview
loading

Model tree for AnonymousSubmissionUnderDouble-BlindRevi/seta-env-rl

Finetuned
Qwen/Qwen3-8B
Finetuned
(1579)
this model