llamacle_drgrpo_v1_step20 - DrGRPO RL on top of llamacle_v6_clean (step 20)

Best loracle ckpt by AB Llama-70B any-match: 74.1% (vs 55.2% pretrain baseline).

AB Llama-70B (3 prompts x 2 rollouts, judge=Sonnet 4.6)

Overall any-match: 0.741 (baseline 0.552)
synth_docs_high: 0.867 (baseline 0.667)
synth_docs_kto: 0.857 (baseline 0.857)
transcripts_high: 0.467 (baseline 0.200)
transcripts_kto: 0.786 (baseline 0.500)
rollout_mean: 0.517

Continuation of ceselder/llamacle_v6_clean_step1875 via online Dr. GRPO RL on the 2,500 held-out FineWeb LoRAs. 32 prompts/cycle x K=16 rollouts (sub-batched 4xK=4), lr=7e-6, eps=0.2/0.28, NF4-DDP across 6 B200s.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ceselder/llamacle_drgrpo_v1_step20

Base model

meta-llama/Llama-3.1-70B

Finetuned

meta-llama/Llama-3.3-70B-Instruct

Finetuned

(654)

this model