File size: 15,418 Bytes
0a55ff6 50998e8 0a55ff6 50998e8 0a55ff6 b5aa8ba 0a55ff6 b5aa8ba aedbd2e 50998e8 aedbd2e 50998e8 0a55ff6 b5aa8ba 0a55ff6 b5aa8ba 0a55ff6 c0da39b 0a55ff6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 | ---
license: apache-2.0
base_model: poolside/Laguna-XS.2
pipeline_tag: text-generation
tags:
- laguna
- laguna-xs.2
- poolside
- moe
- speculative-decoding
- dflash
- inference
- vllm
- lossless
---
# Lean Laguna β Laguna XS.2 + DFlash, lossless single-GPU speedup
*Project: **Lean Laguna** β making Laguna XS.2 cheaper to run and to post-train on a single GPU.*
> **One-line claim:** Laguna XS.2 generates **2.76Γ faster on a single GPU** β **19.6 β 54.2
> tokens/sec** β with **byte-identical greedy output** (0 / 14 mismatches) on a mixed-difficulty code
> set (2.47Γ corroborated on a trivial set; **lossless in both**) vs the no-speculator baseline.
Speculative decoding with Poolside's **DFlash** speculator on **Laguna XS.2**, served in vLLM on
one GPU. The throughput win is measured; the output is provably **lossless under greedy decoding**
(token-for-token identical to baseline) and distribution-preserving under sampling.
Unlike lossy compression β expert pruning or low-bit quantization, which trade output fidelity for a
smaller footprint β this approach changes **nothing** about what the model emits: it cuts the *number*
of expensive forward passes, not the model itself. Lossless speed, not a smaller-but-different model.
Submission for the Poolside Research Hackathon β Foundations track
(`poolside-laguna-hackathon` HF org).
## Goal & judging criteria
> **Meaningfully improve Laguna XS.2, either by:** expanding model use cases (computer use,
> multi-agent coordination, evaluation design); *or* **reducing cost & latency** (optimizations,
> speed, quantization). **For:** an economically valuable task (a function/application); *or*
> **any novel research idea.**
> **Scored on: GENERALISABILITY Β· REPRODUCIBILITY Β· TECHNICAL CONTRIBUTIONS.**
Lean Laguna sits on **reduce cost & latency** for **a novel research idea** (lossless
speculative decoding β cheaper RL rollouts), and is built to score all three axes:
- **Generalisability** β any target + drafter via one `--speculative-config`; the `spec_rl` env +
`configs/endpoints.toml` point any RL run at any OpenAI-compatible endpoint; the reward is a
swappable seam (a *reusable RL environment + reward signal* β a listed submission idea).
- **Reproducibility** β greedy byte-parity + directly-measured throughput behind `make` targets and a
one-command HF-Jobs run (below); anyone re-runs the before/after table. (Ο from `/metrics` read at
the Ξ³+1 ceiling on both runs β we treat it as unreliable and **don't quote it**. HumanEval pass@1
sweep = a documented next step; greedy parity is the stronger guarantee.)
- **Technical contributions** β a measured, provably-lossless throughput win (**2.76Γ** on a
mixed-difficulty code set, 0 mismatches; 2.47Γ corroborated on a trivial set) on the *released*
Laguna XS.2 + DFlash, carried into **cheaper RL rollouts**; the open problem of **speculative
decoding under a moving RL policy** (drafter staleness) and NVFP4 attention-weight calibration as
the posed research stretches.
### Cheaper RL rollouts β the generalisability + frontier story
The speedup is a *decode-time* property, so it carries into any RL trainer whose rollout phase is
OpenAI-compatible vLLM inference β e.g. **`verifiers`** envs (our `spec_rl`, or third-party Hub envs
like [`pandelis/zerolang-editing`](https://app.primeintellect.ai/dashboard/environments/pandelis/zerolang-editing)
β install + repoint `endpoints.toml`, zero code change) and **[OpenPipe ART](https://github.com/openpipe/art)**
(GRPO + LoRA, rollouts served via vLLM). Drop `--speculative-config` into the rollout server β
cheaper rollouts.
**As a cost number (derived from the measured A/B β not a separate RL run):** rollout generation is
decode-bound, so the measured **2.76Γ decode throughput** is β2.76Γ fewer GPU-seconds per rollout β
at any fixed GPU $/hr that is a **~64% lower cost per rollout**. Because the completions are
byte-identical under greedy decoding, the reward signal is unchanged *by construction*, so the cheaper
rollouts cost zero quality. An independent re-run on a fresh GPU reproduced the effect (**2.63Γ, 0/14
mismatches**), so the lossless speedup β and therefore the rollout-cost cut β is reproducible, not a
one-off.
**The environment is executable, not a stub β and public.** `spec_rl` is a `verifiers` environment that runs
end-to-end via both `prime eval run` and hosted `prime train`, published to the Prime Environments Hub as
[`art87able/spec-rl`](https://app.primeintellect.ai/dashboard/environments/art87able/spec-rl) (reusable, one
`prime env install` away). A 12-problem HumanEval slice on Laguna XS.2 (greedy, thinking off) scores a
**mean dense reward of 0.85** (`results/spec_rl_eval.json`; per-rollout rewards include a fractional 0.2,
showing the dense unit-test signal). On a fresh **non-HumanEval, Adaption-generated** code set the *same* env
scores **0.917** (`results/spec_rl_adaption_eval.json`) with one env-var swap β answering "is it just
HumanEval?". Point the env at the DFlash endpoint via `configs/endpoints.toml` and the byte-identical greedy
completions yield the **same reward** β the rollouts just arrive faster.
**We post-trained Laguna XS.2 for real β not just evaluated it.** A **free** hosted GRPO run (`prime train`,
20 steps, batch 64 Γ 8 rollouts, lr 1e-6) on `art87able/spec-rl`, with online evaluation on a **disjoint
held-out split** (HumanEval 50β74, via `eval_base_model=true`). The held-out dense reward rose from **0.90
(untrained base) β 0.96 (post-trained)** β every one of the four post-training checkpoints (0.92β0.96) beat
the base, which is the minimum of all five eval points (`results/rl_after.json`, `results/rl_train_curve.json`).
We report the magnitude honestly: this is a **modest** gain β the split is near-saturated (~0.90 base leaves
little headroom) and greedy MoE eval is not bit-reproducible run-to-run (`results/determinism_check.json`:
identical reruns gave 0.85 / 1.0 / 1.0), so +0.06 sits inside the eval-noise band even as the trend is
consistently positive. The point is not a large jump; it is that **the environment trains the model, not just
scores it** β and the reward moved on data the policy never trained on. Run cost: **$0** (hosted Laguna
training is free). The lossless 2.76Γ decode remains the headline; this is the capstone that the cheaper-RL
claim is now demonstrated by an actual RL run, not only derived from the decode A/B.
**The honest open problem:** in RL the policy moves every batch (e.g. ART's LoRA), so a drafter
trained on the *base* model drifts β acceptance Ο decays β the speedup erodes across training. Within
a batch the policy is frozen, so the per-batch win is real; the frontier is keeping the drafter useful
as the policy moves (periodic drafter distillation, hidden-state-conditioned drafters, or measuring
and amortizing the re-sync cost). This is the "novel research idea" axis, stated plainly.
---
## Method
- **Target model:** `poolside/Laguna-XS.2` β 33.4B-total / 3B-active MoE, single GPU, FP8 native,
128K (β256K) context, Apache 2.0, built for agentic coding.
- **Draft model:** `poolside/Laguna-XS.2-speculator.dflash` β a 0.6B-parameter draft model
(block-diffusion-style speculative-decoding method).
- **How it works:** DFlash proposes **Ξ³ = 7** candidate tokens per round; Laguna XS.2 verifies all
7 in a **single forward pass** and commits the longest matching prefix plus one free bonus token.
Same output, fewer expensive target passes.
- **Why lossless:** under greedy decoding the target only commits tokens equal to its own argmax,
so the output is token-identical to the baseline. Under sampling, vLLM's rejection sampling
preserves the target's output distribution. **Decode-time property β independent of training.**
- **Regime:** the win lands at **low batch / memory-bound decode** β the single-GPU, single-agent
case. It shrinks (and can invert) at high batch / compute-bound. See the honesty note below.
### The exact vLLM flag
Baseline and DFlash differ by **one flag only** β that is the whole experiment:
```bash
--speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'
```
Requires **vLLM β₯ 0.21.0** and `VLLM_USE_DEEP_GEMM=0`.
---
## Results
Same prompts, same `max_tokens`, **temperature 0 (greedy)**, same single GPU,
`--tensor-parallel-size 1`. Only `--speculative-config` differs between the two servers.
Measured on an **H200**, vLLM 0.22.0, `--enforce-eager`, `--max-model-len 4096`, greedy. A
**14-prompt mixed-difficulty** code set (trivial `fib`/`is_prime` β hard `lcs`/`dijkstra`/`LRUCache`),
a corroborating **20-prompt trivial** set, and an **independent re-run** of the mixed set on a fresh
H200 β the lossless speedup reproduced (2.63Γ; run-to-run variance ~2.6β2.8Γ, byte-identical every time).
| Metric | Baseline | + DFlash | Ξ |
|---|---|---|---|
| tokens/sec β mixed-difficulty (N=14) | 19.6 | 54.2 | **2.76Γ** β |
| tokens/sec β trivial (N=20) | 19.5 | 48.1 | **2.47Γ** β |
| tokens/sec β mixed re-run (N=14, fresh GPU) | 19.8 | 52.1 | **2.63Γ** β |
| greedy parity | β | **identical** | **0 mismatches every run** (0/14, 0/20, 0/14) β |
| HumanEval pass@1 | not runβ | not runβ | β |
- **tokens/sec is the headline win** β directly measured wall-clock. The speedup *holds and is larger*
on the harder, more diverse set (**2.76Γ**) than on the trivial one (2.47Γ), and output is
byte-identical in **both**.
- **No acceptance-length (Ο) claim β on purpose.** vLLM's `/metrics` Ο pinned at *exactly* the Ξ³+1
ceiling (8.0) on **both** runs, and per-prompt deltas didn't resolve a distribution β almost
certainly a metrics artifact, not true 100% acceptance. So we report only the directly-measured
speedup + parity and treat Ο as unreliable. *The metric we can't trust, we don't quote.*
- **parity** = baseline vs DFlash greedy outputs are token-identical β the lossless proof.
- **β No TTFT or HumanEval-pass@1 row.** This MIN A/B measured throughput + byte-parity only; the
harness did not isolate true time-to-first-token, and a full HumanEval pass@1 sweep is a documented
next step. Byte-identical greedy output β identical pass@1 *by construction*, so parity is the
stronger guarantee here.
### Ξ³-sweep (throughput-optimal, lossless)
Sweeping `num_speculative_tokens` (Ξ³ β the draft length) on fresh DFlash serves, baseline measured
once (Ξ³-independent: **19.95 tok/s**), decode tok/s over the 14-prompt mixed set, greedy, byte-parity
vs baseline checked at *every* Ξ³ (`results/gamma_sweep.json`):
| Ξ³ | tokens/sec | speedup | lossless |
|---|---|---|---|
| 3 | 44.72 | 2.24Γ | β 0/14 |
| 5 | 52.59 | 2.64Γ | β 0/14 |
| 7 (card default) | 51.74 | 2.59Γ | β 0/14 |
| **9 (Ξ³\*)** | **52.96** | **2.65Γ** | β 0/14 |
| 11 | 48.40 | 2.43Γ | β 0/14 |
The curve **rises then falls**: it climbs from Ξ³=3, plateaus across Ξ³=5β9, **peaks at Ξ³\*=9 (2.65Γ)**,
then **regresses at Ξ³=11 (2.43Γ)** β the classic acceptance/overhead tradeoff (past the point where
extra drafted tokens are still accepted, more draft slots only raise verify cost and waste compute on
rejects faster than they add accepted tokens). The card default **Ξ³=7 sits within ~2.4% of the
optimum**, and β the load-bearing point β **every Ξ³ is byte-lossless** (0/14 mismatches): the
throughput-optimal Ξ³ is also exactly lossless. This is a third, independent corroboration of the
headline (the 2.76Γ / 2.63Γ decode A/Bs being the first two).
### Reward-invariance (by construction)
The `spec_rl` dense unit-test reward (`fraction_passing`) scores a **mean 0.85** over a 12-problem
HumanEval slice via the canonical eval path against hosted Laguna β and the **self-served vLLM
baseline reproduces that 0.85 exactly** (`results/reward_invariance.json`), a clean corroboration.
Reward-invariance under DFlash holds **by construction**: lossless greedy decode (proven byte-identical
in the decode A/B at every Ξ³) β identical rollout text β identical reward, just generated faster. We
**do not claim DFlash improves reward.** A live reward probe of the Ξ³=7 DFlash run returned a higher
number than baseline with a few completions differing, but that is run-to-run greedy MoE
nondeterminism across two separate serves on longer generations β *not* a DFlash quality change β so
we decline to over-interpret it, the same discipline we apply to acceptance length Ο. The
by-construction guarantee, anchored on the measured byte-parity, is the claim that matters.
---
## How to reproduce
**The exact run that produced the numbers above** β one self-contained command on Hugging Face Jobs
(no ssh; serves baseline β measures β re-serves with DFlash β measures β byte-parity), funded by the
HF Jobs credit pool:
```bash
hf jobs uv run --flavor h200 --timeout 1500 --detach --secrets HF_TOKEN scripts/hf_job_ab.py
# then: hf jobs logs <id> β the [job] RESULT / BASELINE_JSON / DFLASH_JSON / PARITY_JSON lines
```
`scripts/hf_job_ab.py` pins the working vLLM env (Triton MoE + Torch sampler + FlashAttention, so no
CUDA toolkit is needed in the slim image β see `THE_JOURNEY.md` for *why*). Below is the equivalent
local two-server flow for any CUDA box with the released weights (vLLM β₯ 0.21.0):
```bash
# 1. Baseline server (speed floor)
python scripts/serve_vllm.py --mode baseline --run # serves on :8000
# 2. Benchmark baseline (separate shell)
python bench/measure.py --base-url http://localhost:8000 --model laguna \
--label baseline --n 20 --out results/baseline.json
# 3. DFlash server β same command + the one --speculative-config flag
python scripts/serve_vllm.py --mode dflash --run
python bench/measure.py --base-url http://localhost:8000 --model laguna \
--label dflash --n 20 --out results/dflash.json
# 4. Quality + lossless parity
python evals/humaneval_subset.py --base-url http://localhost:8000 --model laguna \
--n 25 --out results/humaneval_dflash.json
python evals/humaneval_subset.py --parity \
--base-url http://localhost:8000 --base-url-b http://localhost:8001 --model laguna --n 25
```
The results table above is the diff of `results/baseline.json` and `results/dflash.json` plus the
parity result. Ο is read from vLLM's `/metrics`.
---
## Honesty note β the low-batch regime
This is deliberately a **single-GPU, low-concurrency** result: one box, one agent, maximum
tokens/sec.
Speculative decoding helps **most at low batch size / memory-bound decode**, where each step
reloads the active weights to emit a single token and doing useful work for several tokens per
pass is a large win. It helps **less at high batch size / compute-bound decode** β once the GPU is
saturated, the matmuls dominate and the extra verify work for rejected drafts can slightly hurt.
At very high concurrency you would tune Ξ³ down or turn speculation off.
The reported speedup, Ο, and acceptance numbers are for the low-batch single-GPU regime on
coding-style prompts. The lossless claim (greedy parity) holds regardless of regime β it is a
correctness property of the verification step, not a function of batch size.
---
## License
Apache 2.0, inheriting `poolside/Laguna-XS.2`.
|