art87able commited on
Commit
aedbd2e
Β·
verified Β·
1 Parent(s): 8cc969e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +21 -4
README.md CHANGED
@@ -74,11 +74,28 @@ rollouts cost zero quality. An independent re-run on a fresh GPU reproduced the
74
  mismatches**), so the lossless speedup β€” and therefore the rollout-cost cut β€” is reproducible, not a
75
  one-off.
76
 
77
- **The environment is executable, not a stub.** `spec_rl` is a `verifiers` **v1 taskset+harness** and runs
78
- end-to-end via `prime eval run`: a 12-problem HumanEval slice on Laguna XS.2 (greedy, thinking off) scores a
 
 
79
  **mean dense reward of 0.85** (`results/spec_rl_eval.json`; per-rollout rewards include a fractional 0.2,
80
- showing the dense unit-test signal). Point the same env at the DFlash endpoint via `configs/endpoints.toml`
81
- and the byte-identical greedy completions yield the **same reward** β€” the rollouts just arrive faster.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
  **The honest open problem:** in RL the policy moves every batch (e.g. ART's LoRA), so a drafter
84
  trained on the *base* model drifts β†’ acceptance Ο„ decays β†’ the speedup erodes across training. Within
 
74
  mismatches**), so the lossless speedup β€” and therefore the rollout-cost cut β€” is reproducible, not a
75
  one-off.
76
 
77
+ **The environment is executable, not a stub β€” and public.** `spec_rl` is a `verifiers` environment that runs
78
+ end-to-end via both `prime eval run` and hosted `prime train`, published to the Prime Environments Hub as
79
+ [`art87able/spec-rl`](https://app.primeintellect.ai/dashboard/environments/art87able/spec-rl) (reusable, one
80
+ `prime env install` away). A 12-problem HumanEval slice on Laguna XS.2 (greedy, thinking off) scores a
81
  **mean dense reward of 0.85** (`results/spec_rl_eval.json`; per-rollout rewards include a fractional 0.2,
82
+ showing the dense unit-test signal). On a fresh **non-HumanEval, Adaption-generated** code set the *same* env
83
+ scores **0.917** (`results/spec_rl_adaption_eval.json`) with one env-var swap β€” answering "is it just
84
+ HumanEval?". Point the env at the DFlash endpoint via `configs/endpoints.toml` and the byte-identical greedy
85
+ completions yield the **same reward** β€” the rollouts just arrive faster.
86
+
87
+ **We post-trained Laguna XS.2 for real β€” not just evaluated it.** A **free** hosted GRPO run (`prime train`,
88
+ 20 steps, batch 64 Γ— 8 rollouts, lr 1e-6) on `art87able/spec-rl`, with online evaluation on a **disjoint
89
+ held-out split** (HumanEval 50–74, via `eval_base_model=true`). The held-out dense reward rose from **0.90
90
+ (untrained base) β†’ 0.96 (post-trained)** β€” every one of the four post-training checkpoints (0.92–0.96) beat
91
+ the base, which is the minimum of all five eval points (`results/rl_after.json`, `results/rl_train_curve.json`).
92
+ We report the magnitude honestly: this is a **modest** gain β€” the split is near-saturated (~0.90 base leaves
93
+ little headroom) and greedy MoE eval is not bit-reproducible run-to-run (`results/determinism_check.json`:
94
+ identical reruns gave 0.85 / 1.0 / 1.0), so +0.06 sits inside the eval-noise band even as the trend is
95
+ consistently positive. The point is not a large jump; it is that **the environment trains the model, not just
96
+ scores it** β€” and the reward moved on data the policy never trained on. Run cost: **$0** (hosted Laguna
97
+ training is free). The lossless 2.76Γ— decode remains the headline; this is the capstone that the cheaper-RL
98
+ claim is now demonstrated by an actual RL run, not only derived from the decode A/B.
99
 
100
  **The honest open problem:** in RL the policy moves every batch (e.g. ART's LoRA), so a drafter
101
  trained on the *base* model drifts β†’ acceptance Ο„ decays β†’ the speedup erodes across training. Within