lean-laguna / README.md
art87able's picture
Upload README.md with huggingface_hub
50998e8 verified
metadata
license: apache-2.0
base_model: poolside/Laguna-XS.2
pipeline_tag: text-generation
tags:
  - laguna
  - laguna-xs.2
  - poolside
  - moe
  - speculative-decoding
  - dflash
  - inference
  - vllm
  - lossless

Lean Laguna β€” Laguna XS.2 + DFlash, lossless single-GPU speedup

Project: Lean Laguna β€” making Laguna XS.2 cheaper to run and to post-train on a single GPU.

One-line claim: Laguna XS.2 generates 2.76Γ— faster on a single GPU β€” 19.6 β†’ 54.2 tokens/sec β€” with byte-identical greedy output (0 / 14 mismatches) on a mixed-difficulty code set (2.47Γ— corroborated on a trivial set; lossless in both) vs the no-speculator baseline.

Speculative decoding with Poolside's DFlash speculator on Laguna XS.2, served in vLLM on one GPU. The throughput win is measured; the output is provably lossless under greedy decoding (token-for-token identical to baseline) and distribution-preserving under sampling.

Unlike lossy compression β€” expert pruning or low-bit quantization, which trade output fidelity for a smaller footprint β€” this approach changes nothing about what the model emits: it cuts the number of expensive forward passes, not the model itself. Lossless speed, not a smaller-but-different model.

Submission for the Poolside Research Hackathon β€” Foundations track (poolside-laguna-hackathon HF org).

Goal & judging criteria

Meaningfully improve Laguna XS.2, either by: expanding model use cases (computer use, multi-agent coordination, evaluation design); or reducing cost & latency (optimizations, speed, quantization). For: an economically valuable task (a function/application); or any novel research idea. Scored on: GENERALISABILITY Β· REPRODUCIBILITY Β· TECHNICAL CONTRIBUTIONS.

Lean Laguna sits on reduce cost & latency for a novel research idea (lossless speculative decoding β†’ cheaper RL rollouts), and is built to score all three axes:

  • Generalisability β€” any target + drafter via one --speculative-config; the spec_rl env + configs/endpoints.toml point any RL run at any OpenAI-compatible endpoint; the reward is a swappable seam (a reusable RL environment + reward signal β€” a listed submission idea).
  • Reproducibility β€” greedy byte-parity + directly-measured throughput behind make targets and a one-command HF-Jobs run (below); anyone re-runs the before/after table. (Ο„ from /metrics read at the Ξ³+1 ceiling on both runs β†’ we treat it as unreliable and don't quote it. HumanEval pass@1 sweep = a documented next step; greedy parity is the stronger guarantee.)
  • Technical contributions β€” a measured, provably-lossless throughput win (2.76Γ— on a mixed-difficulty code set, 0 mismatches; 2.47Γ— corroborated on a trivial set) on the released Laguna XS.2 + DFlash, carried into cheaper RL rollouts; the open problem of speculative decoding under a moving RL policy (drafter staleness) and NVFP4 attention-weight calibration as the posed research stretches.

Cheaper RL rollouts β€” the generalisability + frontier story

The speedup is a decode-time property, so it carries into any RL trainer whose rollout phase is OpenAI-compatible vLLM inference β€” e.g. verifiers envs (our spec_rl, or third-party Hub envs like pandelis/zerolang-editing β€” install + repoint endpoints.toml, zero code change) and OpenPipe ART (GRPO + LoRA, rollouts served via vLLM). Drop --speculative-config into the rollout server β†’ cheaper rollouts.

As a cost number (derived from the measured A/B β€” not a separate RL run): rollout generation is decode-bound, so the measured 2.76Γ— decode throughput is β‰ˆ2.76Γ— fewer GPU-seconds per rollout β€” at any fixed GPU $/hr that is a ~64% lower cost per rollout. Because the completions are byte-identical under greedy decoding, the reward signal is unchanged by construction, so the cheaper rollouts cost zero quality. An independent re-run on a fresh GPU reproduced the effect (2.63Γ—, 0/14 mismatches), so the lossless speedup β€” and therefore the rollout-cost cut β€” is reproducible, not a one-off.

The environment is executable, not a stub. spec_rl is a verifiers v1 taskset+harness and runs end-to-end via prime eval run: a 12-problem HumanEval slice on Laguna XS.2 (greedy, thinking off) scores a mean dense reward of 0.85 (results/spec_rl_eval.json; per-rollout rewards include a fractional 0.2, showing the dense unit-test signal). Point the same env at the DFlash endpoint via configs/endpoints.toml and the byte-identical greedy completions yield the same reward β€” the rollouts just arrive faster.

The honest open problem: in RL the policy moves every batch (e.g. ART's LoRA), so a drafter trained on the base model drifts β†’ acceptance Ο„ decays β†’ the speedup erodes across training. Within a batch the policy is frozen, so the per-batch win is real; the frontier is keeping the drafter useful as the policy moves (periodic drafter distillation, hidden-state-conditioned drafters, or measuring and amortizing the re-sync cost). This is the "novel research idea" axis, stated plainly.


Method

  • Target model: poolside/Laguna-XS.2 β€” 33.4B-total / 3B-active MoE, single GPU, FP8 native, 128K (β†’256K) context, Apache 2.0, built for agentic coding.
  • Draft model: poolside/Laguna-XS.2-speculator.dflash β€” a 0.6B-parameter draft model (block-diffusion-style speculative-decoding method).
  • How it works: DFlash proposes Ξ³ = 7 candidate tokens per round; Laguna XS.2 verifies all 7 in a single forward pass and commits the longest matching prefix plus one free bonus token. Same output, fewer expensive target passes.
  • Why lossless: under greedy decoding the target only commits tokens equal to its own argmax, so the output is token-identical to the baseline. Under sampling, vLLM's rejection sampling preserves the target's output distribution. Decode-time property β€” independent of training.
  • Regime: the win lands at low batch / memory-bound decode β€” the single-GPU, single-agent case. It shrinks (and can invert) at high batch / compute-bound. See the honesty note below.

The exact vLLM flag

Baseline and DFlash differ by one flag only β€” that is the whole experiment:

--speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'

Requires vLLM β‰₯ 0.21.0 and VLLM_USE_DEEP_GEMM=0.


Results

Same prompts, same max_tokens, temperature 0 (greedy), same single GPU, --tensor-parallel-size 1. Only --speculative-config differs between the two servers.

Measured on an H200, vLLM 0.22.0, --enforce-eager, --max-model-len 4096, greedy. A 14-prompt mixed-difficulty code set (trivial fib/is_prime β†’ hard lcs/dijkstra/LRUCache), a corroborating 20-prompt trivial set, and an independent re-run of the mixed set on a fresh H200 β€” the lossless speedup reproduced (2.63Γ—; run-to-run variance ~2.6–2.8Γ—, byte-identical every time).

Metric Baseline + DFlash Ξ”
tokens/sec β€” mixed-difficulty (N=14) 19.6 54.2 2.76Γ— ↑
tokens/sec β€” trivial (N=20) 19.5 48.1 2.47Γ— ↑
tokens/sec β€” mixed re-run (N=14, fresh GPU) 19.8 52.1 2.63Γ— ↑
greedy parity β€” identical 0 mismatches every run (0/14, 0/20, 0/14) βœ“
HumanEval pass@1 not run† not run† β€”
  • tokens/sec is the headline win β€” directly measured wall-clock. The speedup holds and is larger on the harder, more diverse set (2.76Γ—) than on the trivial one (2.47Γ—), and output is byte-identical in both.
  • No acceptance-length (Ο„) claim β€” on purpose. vLLM's /metrics Ο„ pinned at exactly the Ξ³+1 ceiling (8.0) on both runs, and per-prompt deltas didn't resolve a distribution β€” almost certainly a metrics artifact, not true 100% acceptance. So we report only the directly-measured speedup + parity and treat Ο„ as unreliable. The metric we can't trust, we don't quote.
  • parity = baseline vs DFlash greedy outputs are token-identical β€” the lossless proof.
  • †No TTFT or HumanEval-pass@1 row. This MIN A/B measured throughput + byte-parity only; the harness did not isolate true time-to-first-token, and a full HumanEval pass@1 sweep is a documented next step. Byte-identical greedy output β‡’ identical pass@1 by construction, so parity is the stronger guarantee here.

How to reproduce

The exact run that produced the numbers above β€” one self-contained command on Hugging Face Jobs (no ssh; serves baseline β†’ measures β†’ re-serves with DFlash β†’ measures β†’ byte-parity), funded by the HF Jobs credit pool:

hf jobs uv run --flavor h200 --timeout 1500 --detach --secrets HF_TOKEN scripts/hf_job_ab.py
# then: hf jobs logs <id>  β†’  the [job] RESULT / BASELINE_JSON / DFLASH_JSON / PARITY_JSON lines

scripts/hf_job_ab.py pins the working vLLM env (Triton MoE + Torch sampler + FlashAttention, so no CUDA toolkit is needed in the slim image β€” see THE_JOURNEY.md for why). Below is the equivalent local two-server flow for any CUDA box with the released weights (vLLM β‰₯ 0.21.0):

# 1. Baseline server (speed floor)
python scripts/serve_vllm.py --mode baseline --run        # serves on :8000

# 2. Benchmark baseline (separate shell)
python bench/measure.py --base-url http://localhost:8000 --model laguna \
    --label baseline --n 20 --out results/baseline.json

# 3. DFlash server β€” same command + the one --speculative-config flag
python scripts/serve_vllm.py --mode dflash --run
python bench/measure.py --base-url http://localhost:8000 --model laguna \
    --label dflash --n 20 --out results/dflash.json

# 4. Quality + lossless parity
python evals/humaneval_subset.py --base-url http://localhost:8000 --model laguna \
    --n 25 --out results/humaneval_dflash.json
python evals/humaneval_subset.py --parity \
    --base-url http://localhost:8000 --base-url-b http://localhost:8001 --model laguna --n 25

The results table above is the diff of results/baseline.json and results/dflash.json plus the parity result. Ο„ is read from vLLM's /metrics.


Honesty note β€” the low-batch regime

This is deliberately a single-GPU, low-concurrency result: one box, one agent, maximum tokens/sec.

Speculative decoding helps most at low batch size / memory-bound decode, where each step reloads the active weights to emit a single token and doing useful work for several tokens per pass is a large win. It helps less at high batch size / compute-bound decode β€” once the GPU is saturated, the matmuls dominate and the extra verify work for rejected drafts can slightly hurt. At very high concurrency you would tune Ξ³ down or turn speculation off.

The reported speedup, Ο„, and acceptance numbers are for the low-batch single-GPU regime on coding-style prompts. The lossless claim (greedy parity) holds regardless of regime β€” it is a correctness property of the verification step, not a function of batch size.


License

Apache 2.0, inheriting poolside/Laguna-XS.2.