Upload README.md with huggingface_hub

50998e8 verified about 2 hours ago

11.5 kB

license: apache-2.0
base_model: poolside/Laguna-XS.2
pipeline_tag: text-generation
tags:
  - laguna
  - laguna-xs.2
  - poolside
  - moe
  - speculative-decoding
  - dflash
  - inference
  - vllm
  - lossless

Lean Laguna — Laguna XS.2 + DFlash, lossless single-GPU speedup

Project: Lean Laguna — making Laguna XS.2 cheaper to run and to post-train on a single GPU.

One-line claim: Laguna XS.2 generates 2.76× faster on a single GPU — 19.6 → 54.2 tokens/sec — with byte-identical greedy output (0 / 14 mismatches) on a mixed-difficulty code set (2.47× corroborated on a trivial set; lossless in both) vs the no-speculator baseline.

Speculative decoding with Poolside's DFlash speculator on Laguna XS.2, served in vLLM on one GPU. The throughput win is measured; the output is provably lossless under greedy decoding (token-for-token identical to baseline) and distribution-preserving under sampling.

Unlike lossy compression — expert pruning or low-bit quantization, which trade output fidelity for a smaller footprint — this approach changes nothing about what the model emits: it cuts the number of expensive forward passes, not the model itself. Lossless speed, not a smaller-but-different model.

Submission for the Poolside Research Hackathon — Foundations track (poolside-laguna-hackathon HF org).

Goal & judging criteria

Meaningfully improve Laguna XS.2, either by: expanding model use cases (computer use, multi-agent coordination, evaluation design); or reducing cost & latency (optimizations, speed, quantization). For: an economically valuable task (a function/application); or any novel research idea. Scored on: GENERALISABILITY · REPRODUCIBILITY · TECHNICAL CONTRIBUTIONS.

Lean Laguna sits on reduce cost & latency for a novel research idea (lossless speculative decoding → cheaper RL rollouts), and is built to score all three axes:

Generalisability — any target + drafter via one --speculative-config; the spec_rl env + configs/endpoints.toml point any RL run at any OpenAI-compatible endpoint; the reward is a swappable seam (a reusable RL environment + reward signal — a listed submission idea).
Reproducibility — greedy byte-parity + directly-measured throughput behind make targets and a one-command HF-Jobs run (below); anyone re-runs the before/after table. (τ from /metrics read at the γ+1 ceiling on both runs → we treat it as unreliable and don't quote it. HumanEval pass@1 sweep = a documented next step; greedy parity is the stronger guarantee.)
Technical contributions — a measured, provably-lossless throughput win (2.76× on a mixed-difficulty code set, 0 mismatches; 2.47× corroborated on a trivial set) on the released Laguna XS.2 + DFlash, carried into cheaper RL rollouts; the open problem of speculative decoding under a moving RL policy (drafter staleness) and NVFP4 attention-weight calibration as the posed research stretches.

Cheaper RL rollouts — the generalisability + frontier story

The speedup is a decode-time property, so it carries into any RL trainer whose rollout phase is OpenAI-compatible vLLM inference — e.g. verifiers envs (our spec_rl, or third-party Hub envs like pandelis/zerolang-editing — install + repoint endpoints.toml, zero code change) and OpenPipe ART (GRPO + LoRA, rollouts served via vLLM). Drop --speculative-config into the rollout server → cheaper rollouts.

As a cost number (derived from the measured A/B — not a separate RL run): rollout generation is decode-bound, so the measured 2.76× decode throughput is ≈2.76× fewer GPU-seconds per rollout — at any fixed GPU $/hr that is a ~64% lower cost per rollout. Because the completions are byte-identical under greedy decoding, the reward signal is unchanged by construction, so the cheaper rollouts cost zero quality. An independent re-run on a fresh GPU reproduced the effect (2.63×, 0/14 mismatches), so the lossless speedup — and therefore the rollout-cost cut — is reproducible, not a one-off.

The environment is executable, not a stub. spec_rl is a verifiers v1 taskset+harness and runs end-to-end via prime eval run: a 12-problem HumanEval slice on Laguna XS.2 (greedy, thinking off) scores a mean dense reward of 0.85 (results/spec_rl_eval.json; per-rollout rewards include a fractional 0.2, showing the dense unit-test signal). Point the same env at the DFlash endpoint via configs/endpoints.toml and the byte-identical greedy completions yield the same reward — the rollouts just arrive faster.

The honest open problem: in RL the policy moves every batch (e.g. ART's LoRA), so a drafter trained on the base model drifts → acceptance τ decays → the speedup erodes across training. Within a batch the policy is frozen, so the per-batch win is real; the frontier is keeping the drafter useful as the policy moves (periodic drafter distillation, hidden-state-conditioned drafters, or measuring and amortizing the re-sync cost). This is the "novel research idea" axis, stated plainly.

Method

Target model: poolside/Laguna-XS.2 — 33.4B-total / 3B-active MoE, single GPU, FP8 native, 128K (→256K) context, Apache 2.0, built for agentic coding.
Draft model: poolside/Laguna-XS.2-speculator.dflash — a 0.6B-parameter draft model (block-diffusion-style speculative-decoding method).
How it works: DFlash proposes γ = 7 candidate tokens per round; Laguna XS.2 verifies all 7 in a single forward pass and commits the longest matching prefix plus one free bonus token. Same output, fewer expensive target passes.
Why lossless: under greedy decoding the target only commits tokens equal to its own argmax, so the output is token-identical to the baseline. Under sampling, vLLM's rejection sampling preserves the target's output distribution. Decode-time property — independent of training.
Regime: the win lands at low batch / memory-bound decode — the single-GPU, single-agent case. It shrinks (and can invert) at high batch / compute-bound. See the honesty note below.

The exact vLLM flag

Baseline and DFlash differ by one flag only — that is the whole experiment:

--speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'

Requires vLLM ≥ 0.21.0 and VLLM_USE_DEEP_GEMM=0.

Results

Same prompts, same max_tokens, temperature 0 (greedy), same single GPU, --tensor-parallel-size 1. Only --speculative-config differs between the two servers.

Measured on an H200, vLLM 0.22.0, --enforce-eager, --max-model-len 4096, greedy. A 14-prompt mixed-difficulty code set (trivial fib/is_prime → hard lcs/dijkstra/LRUCache), a corroborating 20-prompt trivial set, and an independent re-run of the mixed set on a fresh H200 — the lossless speedup reproduced (2.63×; run-to-run variance ~2.6–2.8×, byte-identical every time).

Metric	Baseline	+ DFlash	Δ
tokens/sec — mixed-difficulty (N=14)	19.6	54.2	2.76× ↑
tokens/sec — trivial (N=20)	19.5	48.1	2.47× ↑
tokens/sec — mixed re-run (N=14, fresh GPU)	19.8	52.1	2.63× ↑
greedy parity	—	identical	0 mismatches every run (0/14, 0/20, 0/14) ✓
HumanEval pass@1	not run†	not run†	—

tokens/sec is the headline win — directly measured wall-clock. The speedup holds and is larger on the harder, more diverse set (2.76×) than on the trivial one (2.47×), and output is byte-identical in both.
No acceptance-length (τ) claim — on purpose. vLLM's /metrics τ pinned at exactly the γ+1 ceiling (8.0) on both runs, and per-prompt deltas didn't resolve a distribution — almost certainly a metrics artifact, not true 100% acceptance. So we report only the directly-measured speedup + parity and treat τ as unreliable. The metric we can't trust, we don't quote.
parity = baseline vs DFlash greedy outputs are token-identical — the lossless proof.
†No TTFT or HumanEval-pass@1 row. This MIN A/B measured throughput + byte-parity only; the harness did not isolate true time-to-first-token, and a full HumanEval pass@1 sweep is a documented next step. Byte-identical greedy output ⇒ identical pass@1 by construction, so parity is the stronger guarantee here.

How to reproduce

The exact run that produced the numbers above — one self-contained command on Hugging Face Jobs (no ssh; serves baseline → measures → re-serves with DFlash → measures → byte-parity), funded by the HF Jobs credit pool:

hf jobs uv run --flavor h200 --timeout 1500 --detach --secrets HF_TOKEN scripts/hf_job_ab.py
# then: hf jobs logs <id>  →  the [job] RESULT / BASELINE_JSON / DFLASH_JSON / PARITY_JSON lines

scripts/hf_job_ab.py pins the working vLLM env (Triton MoE + Torch sampler + FlashAttention, so no CUDA toolkit is needed in the slim image — see THE_JOURNEY.md for why). Below is the equivalent local two-server flow for any CUDA box with the released weights (vLLM ≥ 0.21.0):

# 1. Baseline server (speed floor)
python scripts/serve_vllm.py --mode baseline --run        # serves on :8000

# 2. Benchmark baseline (separate shell)
python bench/measure.py --base-url http://localhost:8000 --model laguna \
    --label baseline --n 20 --out results/baseline.json

# 3. DFlash server — same command + the one --speculative-config flag
python scripts/serve_vllm.py --mode dflash --run
python bench/measure.py --base-url http://localhost:8000 --model laguna \
    --label dflash --n 20 --out results/dflash.json

# 4. Quality + lossless parity
python evals/humaneval_subset.py --base-url http://localhost:8000 --model laguna \
    --n 25 --out results/humaneval_dflash.json
python evals/humaneval_subset.py --parity \
    --base-url http://localhost:8000 --base-url-b http://localhost:8001 --model laguna --n 25

The results table above is the diff of results/baseline.json and results/dflash.json plus the parity result. τ is read from vLLM's /metrics.

Honesty note — the low-batch regime

This is deliberately a single-GPU, low-concurrency result: one box, one agent, maximum tokens/sec.

Speculative decoding helps most at low batch size / memory-bound decode, where each step reloads the active weights to emit a single token and doing useful work for several tokens per pass is a large win. It helps less at high batch size / compute-bound decode — once the GPU is saturated, the matmuls dominate and the extra verify work for rejected drafts can slightly hurt. At very high concurrency you would tune γ down or turn speculation off.

The reported speedup, τ, and acceptance numbers are for the low-batch single-GPU regime on coding-style prompts. The lossless claim (greedy parity) holds regardless of regime — it is a correctness property of the verification step, not a function of batch size.

License

Apache 2.0, inheriting poolside/Laguna-XS.2.