llamacle_drgrpo_v1_step30 - DrGRPO RL on top of llamacle_v6_clean (step 30)

Continuation of ceselder/llamacle_v6_clean_step1875 via online Dr. GRPO RL on the 2,500 held-out FineWeb LoRAs from v6 pretrain. 32 prompts/cycle x K=16 rollouts (sub-batched 4xK=4), lr=7e-6, eps=0.2/0.28, NF4-DDP across 6 B200s, forward_ckpt_inject for backward (activation checkpointing).

This is step 30 of 80 (mid-run checkpoint).

Score progression (judge 1-10 mean over kept rollouts)

1: 4.08, 2: 3.85, 3: 4.45, 4: 4.95, 5: 4.95, 6: 5.17, 7: 5.77, 8: 4.21, 9: 4.82, 10: 4.73, 11: 3.09, 12: 3.46, 13: 4.80, 14: 5.24, 15: 5.23, 16: 1.91, 17: 5.34, 18: 4.85, 19: 3.77, 20: 4.68, 21: 3.90, 22: 2.34, 23: 4.79, 24: 4.93, 25: 4.40, 26: 4.70, 27: 4.64, 28: 4.01, 29: 5.62, 30: 4.51

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ceselder/llamacle_drgrpo_v1_step30

Finetuned
(614)
this model