| --- |
| license: apache-2.0 |
| base_model: poolside/Laguna-XS.2 |
| pipeline_tag: text-generation |
| tags: |
| - laguna |
| - laguna-xs.2 |
| - poolside |
| - moe |
| - speculative-decoding |
| - dflash |
| - inference |
| - vllm |
| - lossless |
| --- |
| |
| # Lean Laguna β Laguna XS.2 + DFlash, lossless single-GPU speedup |
|
|
| *Project: **Lean Laguna** β making Laguna XS.2 cheaper to run and to post-train on a single GPU.* |
|
|
| > **One-line claim:** Laguna XS.2 generates **2.76Γ faster on a single GPU** β **19.6 β 54.2 |
| > tokens/sec** β with **byte-identical greedy output** (0 / 14 mismatches) on a mixed-difficulty code |
| > set (2.47Γ corroborated on a trivial set; **lossless in both**) vs the no-speculator baseline. |
|
|
| Speculative decoding with Poolside's **DFlash** speculator on **Laguna XS.2**, served in vLLM on |
| one GPU. The throughput win is measured; the output is provably **lossless under greedy decoding** |
| (token-for-token identical to baseline) and distribution-preserving under sampling. |
|
|
| Unlike lossy compression β expert pruning or low-bit quantization, which trade output fidelity for a |
| smaller footprint β this approach changes **nothing** about what the model emits: it cuts the *number* |
| of expensive forward passes, not the model itself. Lossless speed, not a smaller-but-different model. |
|
|
| Submission for the Poolside Research Hackathon β Foundations track |
| (`poolside-laguna-hackathon` HF org). |
|
|
| ## Goal & judging criteria |
|
|
| > **Meaningfully improve Laguna XS.2, either by:** expanding model use cases (computer use, |
| > multi-agent coordination, evaluation design); *or* **reducing cost & latency** (optimizations, |
| > speed, quantization). **For:** an economically valuable task (a function/application); *or* |
| > **any novel research idea.** |
| > **Scored on: GENERALISABILITY Β· REPRODUCIBILITY Β· TECHNICAL CONTRIBUTIONS.** |
|
|
| Lean Laguna sits on **reduce cost & latency** for **a novel research idea** (lossless |
| speculative decoding β cheaper RL rollouts), and is built to score all three axes: |
|
|
| - **Generalisability** β any target + drafter via one `--speculative-config`; the `spec_rl` env + |
| `configs/endpoints.toml` point any RL run at any OpenAI-compatible endpoint; the reward is a |
| swappable seam (a *reusable RL environment + reward signal* β a listed submission idea). |
| - **Reproducibility** β greedy byte-parity + directly-measured throughput behind `make` targets and a |
| one-command HF-Jobs run (below); anyone re-runs the before/after table. (Ο from `/metrics` read at |
| the Ξ³+1 ceiling on both runs β we treat it as unreliable and **don't quote it**. HumanEval pass@1 |
| sweep = a documented next step; greedy parity is the stronger guarantee.) |
| - **Technical contributions** β a measured, provably-lossless throughput win (**2.76Γ** on a |
| mixed-difficulty code set, 0 mismatches; 2.47Γ corroborated on a trivial set) on the *released* |
| Laguna XS.2 + DFlash, carried into **cheaper RL rollouts**; the open problem of **speculative |
| decoding under a moving RL policy** (drafter staleness) and NVFP4 attention-weight calibration as |
| the posed research stretches. |
|
|
| ### Cheaper RL rollouts β the generalisability + frontier story |
|
|
| The speedup is a *decode-time* property, so it carries into any RL trainer whose rollout phase is |
| OpenAI-compatible vLLM inference β e.g. **`verifiers`** envs (our `spec_rl`, or third-party Hub envs |
| like [`pandelis/zerolang-editing`](https://app.primeintellect.ai/dashboard/environments/pandelis/zerolang-editing) |
| β install + repoint `endpoints.toml`, zero code change) and **[OpenPipe ART](https://github.com/openpipe/art)** |
| (GRPO + LoRA, rollouts served via vLLM). Drop `--speculative-config` into the rollout server β |
| cheaper rollouts. |
|
|
| **As a cost number (derived from the measured A/B β not a separate RL run):** rollout generation is |
| decode-bound, so the measured **2.76Γ decode throughput** is β2.76Γ fewer GPU-seconds per rollout β |
| at any fixed GPU $/hr that is a **~64% lower cost per rollout**. Because the completions are |
| byte-identical under greedy decoding, the reward signal is unchanged *by construction*, so the cheaper |
| rollouts cost zero quality. An independent re-run on a fresh GPU reproduced the effect (**2.63Γ, 0/14 |
| mismatches**), so the lossless speedup β and therefore the rollout-cost cut β is reproducible, not a |
| one-off. |
|
|
| **The environment is executable, not a stub.** `spec_rl` is a `verifiers` **v1 taskset+harness** and runs |
| end-to-end via `prime eval run`: a 12-problem HumanEval slice on Laguna XS.2 (greedy, thinking off) scores a |
| **mean dense reward of 0.85** (`results/spec_rl_eval.json`; per-rollout rewards include a fractional 0.2, |
| showing the dense unit-test signal). Point the same env at the DFlash endpoint via `configs/endpoints.toml` |
| and the byte-identical greedy completions yield the **same reward** β the rollouts just arrive faster. |
|
|
| **The honest open problem:** in RL the policy moves every batch (e.g. ART's LoRA), so a drafter |
| trained on the *base* model drifts β acceptance Ο decays β the speedup erodes across training. Within |
| a batch the policy is frozen, so the per-batch win is real; the frontier is keeping the drafter useful |
| as the policy moves (periodic drafter distillation, hidden-state-conditioned drafters, or measuring |
| and amortizing the re-sync cost). This is the "novel research idea" axis, stated plainly. |
|
|
| --- |
|
|
| ## Method |
|
|
| - **Target model:** `poolside/Laguna-XS.2` β 33.4B-total / 3B-active MoE, single GPU, FP8 native, |
| 128K (β256K) context, Apache 2.0, built for agentic coding. |
| - **Draft model:** `poolside/Laguna-XS.2-speculator.dflash` β a 0.6B-parameter draft model |
| (block-diffusion-style speculative-decoding method). |
| - **How it works:** DFlash proposes **Ξ³ = 7** candidate tokens per round; Laguna XS.2 verifies all |
| 7 in a **single forward pass** and commits the longest matching prefix plus one free bonus token. |
| Same output, fewer expensive target passes. |
| - **Why lossless:** under greedy decoding the target only commits tokens equal to its own argmax, |
| so the output is token-identical to the baseline. Under sampling, vLLM's rejection sampling |
| preserves the target's output distribution. **Decode-time property β independent of training.** |
| - **Regime:** the win lands at **low batch / memory-bound decode** β the single-GPU, single-agent |
| case. It shrinks (and can invert) at high batch / compute-bound. See the honesty note below. |
|
|
| ### The exact vLLM flag |
|
|
| Baseline and DFlash differ by **one flag only** β that is the whole experiment: |
|
|
| ```bash |
| --speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}' |
| ``` |
|
|
| Requires **vLLM β₯ 0.21.0** and `VLLM_USE_DEEP_GEMM=0`. |
|
|
| --- |
|
|
| ## Results |
|
|
| Same prompts, same `max_tokens`, **temperature 0 (greedy)**, same single GPU, |
| `--tensor-parallel-size 1`. Only `--speculative-config` differs between the two servers. |
|
|
| Measured on an **H200**, vLLM 0.22.0, `--enforce-eager`, `--max-model-len 4096`, greedy. A |
| **14-prompt mixed-difficulty** code set (trivial `fib`/`is_prime` β hard `lcs`/`dijkstra`/`LRUCache`), |
| a corroborating **20-prompt trivial** set, and an **independent re-run** of the mixed set on a fresh |
| H200 β the lossless speedup reproduced (2.63Γ; run-to-run variance ~2.6β2.8Γ, byte-identical every time). |
|
|
| | Metric | Baseline | + DFlash | Ξ | |
| |---|---|---|---| |
| | tokens/sec β mixed-difficulty (N=14) | 19.6 | 54.2 | **2.76Γ** β | |
| | tokens/sec β trivial (N=20) | 19.5 | 48.1 | **2.47Γ** β | |
| | tokens/sec β mixed re-run (N=14, fresh GPU) | 19.8 | 52.1 | **2.63Γ** β | |
| | greedy parity | β | **identical** | **0 mismatches every run** (0/14, 0/20, 0/14) β | |
| | HumanEval pass@1 | not runβ | not runβ | β | |
|
|
| - **tokens/sec is the headline win** β directly measured wall-clock. The speedup *holds and is larger* |
| on the harder, more diverse set (**2.76Γ**) than on the trivial one (2.47Γ), and output is |
| byte-identical in **both**. |
| - **No acceptance-length (Ο) claim β on purpose.** vLLM's `/metrics` Ο pinned at *exactly* the Ξ³+1 |
| ceiling (8.0) on **both** runs, and per-prompt deltas didn't resolve a distribution β almost |
| certainly a metrics artifact, not true 100% acceptance. So we report only the directly-measured |
| speedup + parity and treat Ο as unreliable. *The metric we can't trust, we don't quote.* |
| - **parity** = baseline vs DFlash greedy outputs are token-identical β the lossless proof. |
| - **β No TTFT or HumanEval-pass@1 row.** This MIN A/B measured throughput + byte-parity only; the |
| harness did not isolate true time-to-first-token, and a full HumanEval pass@1 sweep is a documented |
| next step. Byte-identical greedy output β identical pass@1 *by construction*, so parity is the |
| stronger guarantee here. |
|
|
| ### Ξ³-sweep (throughput-optimal, lossless) |
|
|
| Sweeping `num_speculative_tokens` (Ξ³ β the draft length) on fresh DFlash serves, baseline measured |
| once (Ξ³-independent: **19.95 tok/s**), decode tok/s over the 14-prompt mixed set, greedy, byte-parity |
| vs baseline checked at *every* Ξ³ (`results/gamma_sweep.json`): |
|
|
| | Ξ³ | tokens/sec | speedup | lossless | |
| |---|---|---|---| |
| | 3 | 44.72 | 2.24Γ | β 0/14 | |
| | 5 | 52.59 | 2.64Γ | β 0/14 | |
| | 7 (card default) | 51.74 | 2.59Γ | β 0/14 | |
| | **9 (Ξ³\*)** | **52.96** | **2.65Γ** | β 0/14 | |
| | 11 | 48.40 | 2.43Γ | β 0/14 | |
|
|
| The curve **rises then falls**: it climbs from Ξ³=3, plateaus across Ξ³=5β9, **peaks at Ξ³\*=9 (2.65Γ)**, |
| then **regresses at Ξ³=11 (2.43Γ)** β the classic acceptance/overhead tradeoff (past the point where |
| extra drafted tokens are still accepted, more draft slots only raise verify cost and waste compute on |
| rejects faster than they add accepted tokens). The card default **Ξ³=7 sits within ~2.4% of the |
| optimum**, and β the load-bearing point β **every Ξ³ is byte-lossless** (0/14 mismatches): the |
| throughput-optimal Ξ³ is also exactly lossless. This is a third, independent corroboration of the |
| headline (the 2.76Γ / 2.63Γ decode A/Bs being the first two). |
|
|
| ### Reward-invariance (by construction) |
|
|
| The `spec_rl` dense unit-test reward (`fraction_passing`) scores a **mean 0.85** over a 12-problem |
| HumanEval slice via the canonical eval path against hosted Laguna β and the **self-served vLLM |
| baseline reproduces that 0.85 exactly** (`results/reward_invariance.json`), a clean corroboration. |
|
|
| Reward-invariance under DFlash holds **by construction**: lossless greedy decode (proven byte-identical |
| in the decode A/B at every Ξ³) β identical rollout text β identical reward, just generated faster. We |
| **do not claim DFlash improves reward.** A live reward probe of the Ξ³=7 DFlash run returned a higher |
| number than baseline with a few completions differing, but that is run-to-run greedy MoE |
| nondeterminism across two separate serves on longer generations β *not* a DFlash quality change β so |
| we decline to over-interpret it, the same discipline we apply to acceptance length Ο. The |
| by-construction guarantee, anchored on the measured byte-parity, is the claim that matters. |
|
|
| --- |
|
|
| ## How to reproduce |
|
|
| **The exact run that produced the numbers above** β one self-contained command on Hugging Face Jobs |
| (no ssh; serves baseline β measures β re-serves with DFlash β measures β byte-parity), funded by the |
| HF Jobs credit pool: |
|
|
| ```bash |
| hf jobs uv run --flavor h200 --timeout 1500 --detach --secrets HF_TOKEN scripts/hf_job_ab.py |
| # then: hf jobs logs <id> β the [job] RESULT / BASELINE_JSON / DFLASH_JSON / PARITY_JSON lines |
| ``` |
|
|
| `scripts/hf_job_ab.py` pins the working vLLM env (Triton MoE + Torch sampler + FlashAttention, so no |
| CUDA toolkit is needed in the slim image β see `THE_JOURNEY.md` for *why*). Below is the equivalent |
| local two-server flow for any CUDA box with the released weights (vLLM β₯ 0.21.0): |
|
|
| ```bash |
| # 1. Baseline server (speed floor) |
| python scripts/serve_vllm.py --mode baseline --run # serves on :8000 |
| |
| # 2. Benchmark baseline (separate shell) |
| python bench/measure.py --base-url http://localhost:8000 --model laguna \ |
| --label baseline --n 20 --out results/baseline.json |
| |
| # 3. DFlash server β same command + the one --speculative-config flag |
| python scripts/serve_vllm.py --mode dflash --run |
| python bench/measure.py --base-url http://localhost:8000 --model laguna \ |
| --label dflash --n 20 --out results/dflash.json |
| |
| # 4. Quality + lossless parity |
| python evals/humaneval_subset.py --base-url http://localhost:8000 --model laguna \ |
| --n 25 --out results/humaneval_dflash.json |
| python evals/humaneval_subset.py --parity \ |
| --base-url http://localhost:8000 --base-url-b http://localhost:8001 --model laguna --n 25 |
| ``` |
|
|
| The results table above is the diff of `results/baseline.json` and `results/dflash.json` plus the |
| parity result. Ο is read from vLLM's `/metrics`. |
|
|
| --- |
|
|
| ## Honesty note β the low-batch regime |
|
|
| This is deliberately a **single-GPU, low-concurrency** result: one box, one agent, maximum |
| tokens/sec. |
|
|
| Speculative decoding helps **most at low batch size / memory-bound decode**, where each step |
| reloads the active weights to emit a single token and doing useful work for several tokens per |
| pass is a large win. It helps **less at high batch size / compute-bound decode** β once the GPU is |
| saturated, the matmuls dominate and the extra verify work for rejected drafts can slightly hurt. |
| At very high concurrency you would tune Ξ³ down or turn speculation off. |
|
|
| The reported speedup, Ο, and acceptance numbers are for the low-batch single-GPU regime on |
| coding-style prompts. The lossless claim (greedy parity) holds regardless of regime β it is a |
| correctness property of the verification step, not a function of batch size. |
|
|
| --- |
|
|
| ## License |
|
|
| Apache 2.0, inheriting `poolside/Laguna-XS.2`. |
|
|