license: apache-2.0
base_model: poolside/Laguna-XS.2
pipeline_tag: text-generation
tags:
- laguna
- laguna-xs.2
- poolside
- moe
- speculative-decoding
- dflash
- inference
- vllm
- lossless
Lean Laguna β Laguna XS.2 + DFlash, lossless single-GPU speedup
Project: Lean Laguna β making Laguna XS.2 cheaper to run and to post-train on a single GPU.
One-line claim: Laguna XS.2 generates 2.76Γ faster on a single GPU β 19.6 β 54.2 tokens/sec β with byte-identical greedy output (0 / 14 mismatches) on a mixed-difficulty code set (2.47Γ corroborated on a trivial set; lossless in both) vs the no-speculator baseline.
Speculative decoding with Poolside's DFlash speculator on Laguna XS.2, served in vLLM on one GPU. The throughput win is measured; the output is provably lossless under greedy decoding (token-for-token identical to baseline) and distribution-preserving under sampling.
Unlike lossy compression β expert pruning or low-bit quantization, which trade output fidelity for a smaller footprint β this approach changes nothing about what the model emits: it cuts the number of expensive forward passes, not the model itself. Lossless speed, not a smaller-but-different model.
Submission for the Poolside Research Hackathon β Foundations track
(poolside-laguna-hackathon HF org).
Goal & judging criteria
Meaningfully improve Laguna XS.2, either by: expanding model use cases (computer use, multi-agent coordination, evaluation design); or reducing cost & latency (optimizations, speed, quantization). For: an economically valuable task (a function/application); or any novel research idea. Scored on: GENERALISABILITY Β· REPRODUCIBILITY Β· TECHNICAL CONTRIBUTIONS.
Lean Laguna sits on reduce cost & latency for a novel research idea (lossless speculative decoding β cheaper RL rollouts), and is built to score all three axes:
- Generalisability β any target + drafter via one
--speculative-config; thespec_rlenv +configs/endpoints.tomlpoint any RL run at any OpenAI-compatible endpoint; the reward is a swappable seam (a reusable RL environment + reward signal β a listed submission idea). - Reproducibility β greedy byte-parity + directly-measured throughput behind
maketargets and a one-command HF-Jobs run (below); anyone re-runs the before/after table. (Ο from/metricsread at the Ξ³+1 ceiling on both runs β we treat it as unreliable and don't quote it. HumanEval pass@1 sweep = a documented next step; greedy parity is the stronger guarantee.) - Technical contributions β a measured, provably-lossless throughput win (2.76Γ on a mixed-difficulty code set, 0 mismatches; 2.47Γ corroborated on a trivial set) on the released Laguna XS.2 + DFlash, carried into cheaper RL rollouts; the open problem of speculative decoding under a moving RL policy (drafter staleness) and NVFP4 attention-weight calibration as the posed research stretches.
Cheaper RL rollouts β the generalisability + frontier story
The speedup is a decode-time property, so it carries into any RL trainer whose rollout phase is
OpenAI-compatible vLLM inference β e.g. verifiers envs (our spec_rl, or third-party Hub envs
like pandelis/zerolang-editing
β install + repoint endpoints.toml, zero code change) and OpenPipe ART
(GRPO + LoRA, rollouts served via vLLM). Drop --speculative-config into the rollout server β
cheaper rollouts.
As a cost number (derived from the measured A/B β not a separate RL run): rollout generation is decode-bound, so the measured 2.76Γ decode throughput is β2.76Γ fewer GPU-seconds per rollout β at any fixed GPU $/hr that is a ~64% lower cost per rollout. Because the completions are byte-identical under greedy decoding, the reward signal is unchanged by construction, so the cheaper rollouts cost zero quality. An independent re-run on a fresh GPU reproduced the effect (2.63Γ, 0/14 mismatches), so the lossless speedup β and therefore the rollout-cost cut β is reproducible, not a one-off.
The environment is executable, not a stub. spec_rl is a verifiers v1 taskset+harness and runs
end-to-end via prime eval run: a 12-problem HumanEval slice on Laguna XS.2 (greedy, thinking off) scores a
mean dense reward of 0.85 (results/spec_rl_eval.json; per-rollout rewards include a fractional 0.2,
showing the dense unit-test signal). Point the same env at the DFlash endpoint via configs/endpoints.toml
and the byte-identical greedy completions yield the same reward β the rollouts just arrive faster.
The honest open problem: in RL the policy moves every batch (e.g. ART's LoRA), so a drafter trained on the base model drifts β acceptance Ο decays β the speedup erodes across training. Within a batch the policy is frozen, so the per-batch win is real; the frontier is keeping the drafter useful as the policy moves (periodic drafter distillation, hidden-state-conditioned drafters, or measuring and amortizing the re-sync cost). This is the "novel research idea" axis, stated plainly.
Method
- Target model:
poolside/Laguna-XS.2β 33.4B-total / 3B-active MoE, single GPU, FP8 native, 128K (β256K) context, Apache 2.0, built for agentic coding. - Draft model:
poolside/Laguna-XS.2-speculator.dflashβ a 0.6B-parameter draft model (block-diffusion-style speculative-decoding method). - How it works: DFlash proposes Ξ³ = 7 candidate tokens per round; Laguna XS.2 verifies all 7 in a single forward pass and commits the longest matching prefix plus one free bonus token. Same output, fewer expensive target passes.
- Why lossless: under greedy decoding the target only commits tokens equal to its own argmax, so the output is token-identical to the baseline. Under sampling, vLLM's rejection sampling preserves the target's output distribution. Decode-time property β independent of training.
- Regime: the win lands at low batch / memory-bound decode β the single-GPU, single-agent case. It shrinks (and can invert) at high batch / compute-bound. See the honesty note below.
The exact vLLM flag
Baseline and DFlash differ by one flag only β that is the whole experiment:
--speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'
Requires vLLM β₯ 0.21.0 and VLLM_USE_DEEP_GEMM=0.
Results
Same prompts, same max_tokens, temperature 0 (greedy), same single GPU,
--tensor-parallel-size 1. Only --speculative-config differs between the two servers.
Measured on an H200, vLLM 0.22.0, --enforce-eager, --max-model-len 4096, greedy. A
14-prompt mixed-difficulty code set (trivial fib/is_prime β hard lcs/dijkstra/LRUCache),
a corroborating 20-prompt trivial set, and an independent re-run of the mixed set on a fresh
H200 β the lossless speedup reproduced (2.63Γ; run-to-run variance ~2.6β2.8Γ, byte-identical every time).
| Metric | Baseline | + DFlash | Ξ |
|---|---|---|---|
| tokens/sec β mixed-difficulty (N=14) | 19.6 | 54.2 | 2.76Γ β |
| tokens/sec β trivial (N=20) | 19.5 | 48.1 | 2.47Γ β |
| tokens/sec β mixed re-run (N=14, fresh GPU) | 19.8 | 52.1 | 2.63Γ β |
| greedy parity | β | identical | 0 mismatches every run (0/14, 0/20, 0/14) β |
| HumanEval pass@1 | not runβ | not runβ | β |
- tokens/sec is the headline win β directly measured wall-clock. The speedup holds and is larger on the harder, more diverse set (2.76Γ) than on the trivial one (2.47Γ), and output is byte-identical in both.
- No acceptance-length (Ο) claim β on purpose. vLLM's
/metricsΟ pinned at exactly the Ξ³+1 ceiling (8.0) on both runs, and per-prompt deltas didn't resolve a distribution β almost certainly a metrics artifact, not true 100% acceptance. So we report only the directly-measured speedup + parity and treat Ο as unreliable. The metric we can't trust, we don't quote. - parity = baseline vs DFlash greedy outputs are token-identical β the lossless proof.
- β No TTFT or HumanEval-pass@1 row. This MIN A/B measured throughput + byte-parity only; the harness did not isolate true time-to-first-token, and a full HumanEval pass@1 sweep is a documented next step. Byte-identical greedy output β identical pass@1 by construction, so parity is the stronger guarantee here.
How to reproduce
The exact run that produced the numbers above β one self-contained command on Hugging Face Jobs (no ssh; serves baseline β measures β re-serves with DFlash β measures β byte-parity), funded by the HF Jobs credit pool:
hf jobs uv run --flavor h200 --timeout 1500 --detach --secrets HF_TOKEN scripts/hf_job_ab.py
# then: hf jobs logs <id> β the [job] RESULT / BASELINE_JSON / DFLASH_JSON / PARITY_JSON lines
scripts/hf_job_ab.py pins the working vLLM env (Triton MoE + Torch sampler + FlashAttention, so no
CUDA toolkit is needed in the slim image β see THE_JOURNEY.md for why). Below is the equivalent
local two-server flow for any CUDA box with the released weights (vLLM β₯ 0.21.0):
# 1. Baseline server (speed floor)
python scripts/serve_vllm.py --mode baseline --run # serves on :8000
# 2. Benchmark baseline (separate shell)
python bench/measure.py --base-url http://localhost:8000 --model laguna \
--label baseline --n 20 --out results/baseline.json
# 3. DFlash server β same command + the one --speculative-config flag
python scripts/serve_vllm.py --mode dflash --run
python bench/measure.py --base-url http://localhost:8000 --model laguna \
--label dflash --n 20 --out results/dflash.json
# 4. Quality + lossless parity
python evals/humaneval_subset.py --base-url http://localhost:8000 --model laguna \
--n 25 --out results/humaneval_dflash.json
python evals/humaneval_subset.py --parity \
--base-url http://localhost:8000 --base-url-b http://localhost:8001 --model laguna --n 25
The results table above is the diff of results/baseline.json and results/dflash.json plus the
parity result. Ο is read from vLLM's /metrics.
Honesty note β the low-batch regime
This is deliberately a single-GPU, low-concurrency result: one box, one agent, maximum tokens/sec.
Speculative decoding helps most at low batch size / memory-bound decode, where each step reloads the active weights to emit a single token and doing useful work for several tokens per pass is a large win. It helps less at high batch size / compute-bound decode β once the GPU is saturated, the matmuls dominate and the extra verify work for rejected drafts can slightly hurt. At very high concurrency you would tune Ξ³ down or turn speculation off.
The reported speedup, Ο, and acceptance numbers are for the low-batch single-GPU regime on coding-style prompts. The lossless claim (greedy parity) holds regardless of regime β it is a correctness property of the verification step, not a function of batch size.
License
Apache 2.0, inheriting poolside/Laguna-XS.2.