ablation-pymethods2test-seqnorm-15-8B

RLOO length-bias ablation — arm1 seqnorm (base) arm. RL-trained from the a3 pre-RL base on the exp_rpt_pymethods2test-large task set with SkyRL (RLOO-n advantage estimator, seq_mean_token_sum_norm_global loss reduction).

Base model: laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink (a Qwen3-8B SFT)
Dataset: DCAgent/exp_rpt_pymethods2test-large
Checkpoint: global_step_15 (selected by trailing-5 EMA of reward/avg_raw_reward, alpha=1/3, over the full chain; max_steps=80, hf_save_interval=5)
Training framework: SkyRL (FSDP2, vLLM), Jupiter GH200

The launch config is included as rl_config.yaml.

Training Traces

Training-time Daytona/Harbor rollouts for this run are uploaded as a companion dataset: penfever/ablation-pymethods2test-seqnorm

The dataset contains the last episode of each trial (per make_and_upload_trace_dataset --episodes last) — the same rollouts the policy was trained on after rollback / truncation.

Downloads last month: 33

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support