Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -74,11 +74,28 @@ rollouts cost zero quality. An independent re-run on a fresh GPU reproduced the
|
|
| 74 |
mismatches**), so the lossless speedup β and therefore the rollout-cost cut β is reproducible, not a
|
| 75 |
one-off.
|
| 76 |
|
| 77 |
-
**The environment is executable, not a stub.** `spec_rl` is a `verifiers`
|
| 78 |
-
end-to-end via `prime eval run`
|
|
|
|
|
|
|
| 79 |
**mean dense reward of 0.85** (`results/spec_rl_eval.json`; per-rollout rewards include a fractional 0.2,
|
| 80 |
-
showing the dense unit-test signal).
|
| 81 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
**The honest open problem:** in RL the policy moves every batch (e.g. ART's LoRA), so a drafter
|
| 84 |
trained on the *base* model drifts β acceptance Ο decays β the speedup erodes across training. Within
|
|
|
|
| 74 |
mismatches**), so the lossless speedup β and therefore the rollout-cost cut β is reproducible, not a
|
| 75 |
one-off.
|
| 76 |
|
| 77 |
+
**The environment is executable, not a stub β and public.** `spec_rl` is a `verifiers` environment that runs
|
| 78 |
+
end-to-end via both `prime eval run` and hosted `prime train`, published to the Prime Environments Hub as
|
| 79 |
+
[`art87able/spec-rl`](https://app.primeintellect.ai/dashboard/environments/art87able/spec-rl) (reusable, one
|
| 80 |
+
`prime env install` away). A 12-problem HumanEval slice on Laguna XS.2 (greedy, thinking off) scores a
|
| 81 |
**mean dense reward of 0.85** (`results/spec_rl_eval.json`; per-rollout rewards include a fractional 0.2,
|
| 82 |
+
showing the dense unit-test signal). On a fresh **non-HumanEval, Adaption-generated** code set the *same* env
|
| 83 |
+
scores **0.917** (`results/spec_rl_adaption_eval.json`) with one env-var swap β answering "is it just
|
| 84 |
+
HumanEval?". Point the env at the DFlash endpoint via `configs/endpoints.toml` and the byte-identical greedy
|
| 85 |
+
completions yield the **same reward** β the rollouts just arrive faster.
|
| 86 |
+
|
| 87 |
+
**We post-trained Laguna XS.2 for real β not just evaluated it.** A **free** hosted GRPO run (`prime train`,
|
| 88 |
+
20 steps, batch 64 Γ 8 rollouts, lr 1e-6) on `art87able/spec-rl`, with online evaluation on a **disjoint
|
| 89 |
+
held-out split** (HumanEval 50β74, via `eval_base_model=true`). The held-out dense reward rose from **0.90
|
| 90 |
+
(untrained base) β 0.96 (post-trained)** β every one of the four post-training checkpoints (0.92β0.96) beat
|
| 91 |
+
the base, which is the minimum of all five eval points (`results/rl_after.json`, `results/rl_train_curve.json`).
|
| 92 |
+
We report the magnitude honestly: this is a **modest** gain β the split is near-saturated (~0.90 base leaves
|
| 93 |
+
little headroom) and greedy MoE eval is not bit-reproducible run-to-run (`results/determinism_check.json`:
|
| 94 |
+
identical reruns gave 0.85 / 1.0 / 1.0), so +0.06 sits inside the eval-noise band even as the trend is
|
| 95 |
+
consistently positive. The point is not a large jump; it is that **the environment trains the model, not just
|
| 96 |
+
scores it** β and the reward moved on data the policy never trained on. Run cost: **$0** (hosted Laguna
|
| 97 |
+
training is free). The lossless 2.76Γ decode remains the headline; this is the capstone that the cheaper-RL
|
| 98 |
+
claim is now demonstrated by an actual RL run, not only derived from the decode A/B.
|
| 99 |
|
| 100 |
**The honest open problem:** in RL the policy moves every batch (e.g. ART's LoRA), so a drafter
|
| 101 |
trained on the *base* model drifts β acceptance Ο decays β the speedup erodes across training. Within
|