poolside-laguna-hackathon
/

lean-laguna

@@ -74,11 +74,28 @@ rollouts cost zero quality. An independent re-run on a fresh GPU reproduced the
 mismatches**), so the lossless speedup — and therefore the rollout-cost cut — is reproducible, not a
 one-off.
-**The environment is executable, not a stub.** `spec_rl` is a `verifiers` **v1 taskset+harness** and runs
-end-to-end via `prime eval run`: a 12-problem HumanEval slice on Laguna XS.2 (greedy, thinking off) scores a
 **mean dense reward of 0.85** (`results/spec_rl_eval.json`; per-rollout rewards include a fractional 0.2,
-showing the dense unit-test signal). Point the same env at the DFlash endpoint via `configs/endpoints.toml`
-and the byte-identical greedy completions yield the **same reward** — the rollouts just arrive faster.
 **The honest open problem:** in RL the policy moves every batch (e.g. ART's LoRA), so a drafter
 trained on the *base* model drifts → acceptance τ decays → the speedup erodes across training. Within

 mismatches**), so the lossless speedup — and therefore the rollout-cost cut — is reproducible, not a
 one-off.
+**The environment is executable, not a stub — and public.** `spec_rl` is a `verifiers` environment that runs
+end-to-end via both `prime eval run` and hosted `prime train`, published to the Prime Environments Hub as
+[`art87able/spec-rl`](https://app.primeintellect.ai/dashboard/environments/art87able/spec-rl) (reusable, one
+`prime env install` away). A 12-problem HumanEval slice on Laguna XS.2 (greedy, thinking off) scores a
 **mean dense reward of 0.85** (`results/spec_rl_eval.json`; per-rollout rewards include a fractional 0.2,
+showing the dense unit-test signal). On a fresh **non-HumanEval, Adaption-generated** code set the *same* env
+scores **0.917** (`results/spec_rl_adaption_eval.json`) with one env-var swap — answering "is it just
+HumanEval?". Point the env at the DFlash endpoint via `configs/endpoints.toml` and the byte-identical greedy
+completions yield the **same reward** — the rollouts just arrive faster.
+**We post-trained Laguna XS.2 for real — not just evaluated it.** A **free** hosted GRPO run (`prime train`,
+20 steps, batch 64 × 8 rollouts, lr 1e-6) on `art87able/spec-rl`, with online evaluation on a **disjoint
+held-out split** (HumanEval 50–74, via `eval_base_model=true`). The held-out dense reward rose from **0.90
+(untrained base) → 0.96 (post-trained)** — every one of the four post-training checkpoints (0.92–0.96) beat
+the base, which is the minimum of all five eval points (`results/rl_after.json`, `results/rl_train_curve.json`).
+We report the magnitude honestly: this is a **modest** gain — the split is near-saturated (~0.90 base leaves
+little headroom) and greedy MoE eval is not bit-reproducible run-to-run (`results/determinism_check.json`:
+identical reruns gave 0.85 / 1.0 / 1.0), so +0.06 sits inside the eval-noise band even as the trend is
+consistently positive. The point is not a large jump; it is that **the environment trains the model, not just
+scores it** — and the reward moved on data the policy never trained on. Run cost: **$0** (hosted Laguna
+training is free). The lossless 2.76× decode remains the headline; this is the capstone that the cheaper-RL
+claim is now demonstrated by an actual RL run, not only derived from the decode A/B.
 **The honest open problem:** in RL the policy moves every batch (e.g. ART's LoRA), so a drafter
 trained on the *base* model drifts → acceptance τ decays → the speedup erodes across training. Within