Upload README.md with huggingface_hub

c0da39b verified 42 minutes ago

13.7 kB

	---
	license: apache-2.0
	base_model: poolside/Laguna-XS.2
	pipeline_tag: text-generation
	tags:
	- laguna
	- laguna-xs.2
	- poolside
	- moe
	- speculative-decoding
	- dflash
	- inference
	- vllm
	- lossless
	---

	# Lean Laguna — Laguna XS.2 + DFlash, lossless single-GPU speedup

	Project: Lean Laguna* — making Laguna XS.2 cheaper to run and to post-train on a single GPU.*

	> One-line claim: Laguna XS.2 generates 2.76× faster on a single GPU — **19.6 → 54.2
	> tokens/sec — with byte-identical greedy output** (0 / 14 mismatches) on a mixed-difficulty code
	> set (2.47× corroborated on a trivial set; lossless in both) vs the no-speculator baseline.

	Speculative decoding with Poolside's DFlash speculator on Laguna XS.2, served in vLLM on
	one GPU. The throughput win is measured; the output is provably lossless under greedy decoding
	(token-for-token identical to baseline) and distribution-preserving under sampling.

	Unlike lossy compression — expert pruning or low-bit quantization, which trade output fidelity for a
	smaller footprint — this approach changes nothing about what the model emits: it cuts the number
	of expensive forward passes, not the model itself. Lossless speed, not a smaller-but-different model.

	Submission for the Poolside Research Hackathon — Foundations track
	(`poolside-laguna-hackathon` HF org).

	## Goal & judging criteria

	> Meaningfully improve Laguna XS.2, either by: expanding model use cases (computer use,
	> multi-agent coordination, evaluation design); or reducing cost & latency (optimizations,
	> speed, quantization). For: an economically valuable task (a function/application); or
	> any novel research idea.
	> Scored on: GENERALISABILITY · REPRODUCIBILITY · TECHNICAL CONTRIBUTIONS.

	Lean Laguna sits on reduce cost & latency for a novel research idea (lossless
	speculative decoding → cheaper RL rollouts), and is built to score all three axes:

	- Generalisability — any target + drafter via one `--speculative-config`; the `spec_rl` env +
	`configs/endpoints.toml` point any RL run at any OpenAI-compatible endpoint; the reward is a
	swappable seam (a reusable RL environment + reward signal — a listed submission idea).
	- Reproducibility — greedy byte-parity + directly-measured throughput behind `make` targets and a
	one-command HF-Jobs run (below); anyone re-runs the before/after table. (τ from `/metrics` read at
	the γ+1 ceiling on both runs → we treat it as unreliable and don't quote it. HumanEval pass@1
	sweep = a documented next step; greedy parity is the stronger guarantee.)
	- Technical contributions — a measured, provably-lossless throughput win (2.76× on a
	mixed-difficulty code set, 0 mismatches; 2.47× corroborated on a trivial set) on the released
	Laguna XS.2 + DFlash, carried into cheaper RL rollouts; the open problem of **speculative
	decoding under a moving RL policy** (drafter staleness) and NVFP4 attention-weight calibration as
	the posed research stretches.

	### Cheaper RL rollouts — the generalisability + frontier story

	The speedup is a decode-time property, so it carries into any RL trainer whose rollout phase is
	OpenAI-compatible vLLM inference — e.g. `verifiers` envs (our `spec_rl`, or third-party Hub envs
	like [`pandelis/zerolang-editing`](https://app.primeintellect.ai/dashboard/environments/pandelis/zerolang-editing)
	— install + repoint `endpoints.toml`, zero code change) and [OpenPipe ART](https://github.com/openpipe/art)
	(GRPO + LoRA, rollouts served via vLLM). Drop `--speculative-config` into the rollout server →
	cheaper rollouts.

	As a cost number (derived from the measured A/B — not a separate RL run): rollout generation is
	decode-bound, so the measured 2.76× decode throughput is ≈2.76× fewer GPU-seconds per rollout —
	at any fixed GPU $/hr that is a ~64% lower cost per rollout. Because the completions are
	byte-identical under greedy decoding, the reward signal is unchanged by construction, so the cheaper
	rollouts cost zero quality. An independent re-run on a fresh GPU reproduced the effect (**2.63×, 0/14
	mismatches**), so the lossless speedup — and therefore the rollout-cost cut — is reproducible, not a
	one-off.

	The environment is executable, not a stub. `spec_rl` is a `verifiers` v1 taskset+harness and runs
	end-to-end via `prime eval run`: a 12-problem HumanEval slice on Laguna XS.2 (greedy, thinking off) scores a
	mean dense reward of 0.85 (`results/spec_rl_eval.json`; per-rollout rewards include a fractional 0.2,
	showing the dense unit-test signal). Point the same env at the DFlash endpoint via `configs/endpoints.toml`
	and the byte-identical greedy completions yield the same reward — the rollouts just arrive faster.

	The honest open problem: in RL the policy moves every batch (e.g. ART's LoRA), so a drafter
	trained on the base model drifts → acceptance τ decays → the speedup erodes across training. Within
	a batch the policy is frozen, so the per-batch win is real; the frontier is keeping the drafter useful
	as the policy moves (periodic drafter distillation, hidden-state-conditioned drafters, or measuring
	and amortizing the re-sync cost). This is the "novel research idea" axis, stated plainly.

	---

	## Method

	- Target model: `poolside/Laguna-XS.2` — 33.4B-total / 3B-active MoE, single GPU, FP8 native,
	128K (→256K) context, Apache 2.0, built for agentic coding.
	- Draft model: `poolside/Laguna-XS.2-speculator.dflash` — a 0.6B-parameter draft model
	(block-diffusion-style speculative-decoding method).
	- How it works: DFlash proposes γ = 7 candidate tokens per round; Laguna XS.2 verifies all
	7 in a single forward pass and commits the longest matching prefix plus one free bonus token.
	Same output, fewer expensive target passes.
	- Why lossless: under greedy decoding the target only commits tokens equal to its own argmax,
	so the output is token-identical to the baseline. Under sampling, vLLM's rejection sampling
	preserves the target's output distribution. Decode-time property — independent of training.
	- Regime: the win lands at low batch / memory-bound decode — the single-GPU, single-agent
	case. It shrinks (and can invert) at high batch / compute-bound. See the honesty note below.

	### The exact vLLM flag

	Baseline and DFlash differ by one flag only — that is the whole experiment:

	```bash
	--speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'
	```

	Requires vLLM ≥ 0.21.0 and `VLLM_USE_DEEP_GEMM=0`.

	---

	## Results

	Same prompts, same `max_tokens`, temperature 0 (greedy), same single GPU,
	`--tensor-parallel-size 1`. Only `--speculative-config` differs between the two servers.

	Measured on an H200, vLLM 0.22.0, `--enforce-eager`, `--max-model-len 4096`, greedy. A
	14-prompt mixed-difficulty code set (trivial `fib`/`is_prime` → hard `lcs`/`dijkstra`/`LRUCache`),
	a corroborating 20-prompt trivial set, and an independent re-run of the mixed set on a fresh
	H200 — the lossless speedup reproduced (2.63×; run-to-run variance ~2.6–2.8×, byte-identical every time).

	\| Metric \| Baseline \| + DFlash \| Δ \|
	\|---\|---\|---\|---\|
	\| tokens/sec — mixed-difficulty (N=14) \| 19.6 \| 54.2 \| 2.76× ↑ \|
	\| tokens/sec — trivial (N=20) \| 19.5 \| 48.1 \| 2.47× ↑ \|
	\| tokens/sec — mixed re-run (N=14, fresh GPU) \| 19.8 \| 52.1 \| 2.63× ↑ \|
	\| greedy parity \| — \| identical \| 0 mismatches every run (0/14, 0/20, 0/14) ✓ \|
	\| HumanEval pass@1 \| not run† \| not run† \| — \|

	- tokens/sec is the headline win — directly measured wall-clock. The speedup holds and is larger
	on the harder, more diverse set (2.76×) than on the trivial one (2.47×), and output is
	byte-identical in both.
	- No acceptance-length (τ) claim — on purpose. vLLM's `/metrics` τ pinned at exactly the γ+1
	ceiling (8.0) on both runs, and per-prompt deltas didn't resolve a distribution — almost
	certainly a metrics artifact, not true 100% acceptance. So we report only the directly-measured
	speedup + parity and treat τ as unreliable. The metric we can't trust, we don't quote.
	- parity = baseline vs DFlash greedy outputs are token-identical — the lossless proof.
	- †No TTFT or HumanEval-pass@1 row. This MIN A/B measured throughput + byte-parity only; the
	harness did not isolate true time-to-first-token, and a full HumanEval pass@1 sweep is a documented
	next step. Byte-identical greedy output ⇒ identical pass@1 by construction, so parity is the
	stronger guarantee here.

	### γ-sweep (throughput-optimal, lossless)

	Sweeping `num_speculative_tokens` (γ — the draft length) on fresh DFlash serves, baseline measured
	once (γ-independent: 19.95 tok/s), decode tok/s over the 14-prompt mixed set, greedy, byte-parity
	vs baseline checked at every γ (`results/gamma_sweep.json`):

	\| γ \| tokens/sec \| speedup \| lossless \|
	\|---\|---\|---\|---\|
	\| 3 \| 44.72 \| 2.24× \| ✓ 0/14 \|
	\| 5 \| 52.59 \| 2.64× \| ✓ 0/14 \|
	\| 7 (card default) \| 51.74 \| 2.59× \| ✓ 0/14 \|
	\| *9 (γ\) \| 52.96 \| 2.65×** \| ✓ 0/14 \|
	\| 11 \| 48.40 \| 2.43× \| ✓ 0/14 \|

	The curve rises then falls: it climbs from γ=3, plateaus across γ=5–9, *peaks at γ\=9 (2.65×)**,
	then regresses at γ=11 (2.43×) — the classic acceptance/overhead tradeoff (past the point where
	extra drafted tokens are still accepted, more draft slots only raise verify cost and waste compute on
	rejects faster than they add accepted tokens). The card default **γ=7 sits within ~2.4% of the
	optimum, and — the load-bearing point — every γ is byte-lossless** (0/14 mismatches): the
	throughput-optimal γ is also exactly lossless. This is a third, independent corroboration of the
	headline (the 2.76× / 2.63× decode A/Bs being the first two).

	### Reward-invariance (by construction)

	The `spec_rl` dense unit-test reward (`fraction_passing`) scores a mean 0.85 over a 12-problem
	HumanEval slice via the canonical eval path against hosted Laguna — and the **self-served vLLM
	baseline reproduces that 0.85 exactly** (`results/reward_invariance.json`), a clean corroboration.

	Reward-invariance under DFlash holds by construction: lossless greedy decode (proven byte-identical
	in the decode A/B at every γ) ⇒ identical rollout text ⇒ identical reward, just generated faster. We
	do not claim DFlash improves reward. A live reward probe of the γ=7 DFlash run returned a higher
	number than baseline with a few completions differing, but that is run-to-run greedy MoE
	nondeterminism across two separate serves on longer generations — not a DFlash quality change — so
	we decline to over-interpret it, the same discipline we apply to acceptance length τ. The
	by-construction guarantee, anchored on the measured byte-parity, is the claim that matters.

	---

	## How to reproduce

	The exact run that produced the numbers above — one self-contained command on Hugging Face Jobs
	(no ssh; serves baseline → measures → re-serves with DFlash → measures → byte-parity), funded by the
	HF Jobs credit pool:

	```bash
	hf jobs uv run --flavor h200 --timeout 1500 --detach --secrets HF_TOKEN scripts/hf_job_ab.py
	# then: hf jobs logs <id> → the [job] RESULT / BASELINE_JSON / DFLASH_JSON / PARITY_JSON lines
	```

	`scripts/hf_job_ab.py` pins the working vLLM env (Triton MoE + Torch sampler + FlashAttention, so no
	CUDA toolkit is needed in the slim image — see `THE_JOURNEY.md` for why). Below is the equivalent
	local two-server flow for any CUDA box with the released weights (vLLM ≥ 0.21.0):

	```bash
	# 1. Baseline server (speed floor)
	python scripts/serve_vllm.py --mode baseline --run # serves on :8000

	# 2. Benchmark baseline (separate shell)
	python bench/measure.py --base-url http://localhost:8000 --model laguna \
	--label baseline --n 20 --out results/baseline.json

	# 3. DFlash server — same command + the one --speculative-config flag
	python scripts/serve_vllm.py --mode dflash --run
	python bench/measure.py --base-url http://localhost:8000 --model laguna \
	--label dflash --n 20 --out results/dflash.json

	# 4. Quality + lossless parity
	python evals/humaneval_subset.py --base-url http://localhost:8000 --model laguna \
	--n 25 --out results/humaneval_dflash.json
	python evals/humaneval_subset.py --parity \
	--base-url http://localhost:8000 --base-url-b http://localhost:8001 --model laguna --n 25
	```

	The results table above is the diff of `results/baseline.json` and `results/dflash.json` plus the
	parity result. τ is read from vLLM's `/metrics`.

	---

	## Honesty note — the low-batch regime

	This is deliberately a single-GPU, low-concurrency result: one box, one agent, maximum
	tokens/sec.

	Speculative decoding helps most at low batch size / memory-bound decode, where each step
	reloads the active weights to emit a single token and doing useful work for several tokens per
	pass is a large win. It helps less at high batch size / compute-bound decode — once the GPU is
	saturated, the matmuls dominate and the extra verify work for rejected drafts can slightly hurt.
	At very high concurrency you would tune γ down or turn speculation off.

	The reported speedup, τ, and acceptance numbers are for the low-batch single-GPU regime on
	coding-style prompts. The lossless claim (greedy parity) holds regardless of regime — it is a
	correctness property of the verification step, not a function of batch size.

	---

	## License

	Apache 2.0, inheriting `poolside/Laguna-XS.2`.