poolside-laguna-hackathon
/

lean-laguna

Text Generation

Mixture of Experts

speculative-decoding

Model card Files Files and versions

art87able commited on about 2 hours ago

Commit

c0da39b

·

verified ·

1 Parent(s): ee4e9e6

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +36 -0

README.md CHANGED Viewed

@@ -146,6 +146,42 @@ H200 — the lossless speedup reproduced (2.63×; run-to-run variance ~2.6–2.8
   next step. Byte-identical greedy output ⇒ identical pass@1 *by construction*, so parity is the
   stronger guarantee here.
 ---
 ## How to reproduce

   next step. Byte-identical greedy output ⇒ identical pass@1 *by construction*, so parity is the
   stronger guarantee here.
+### γ-sweep (throughput-optimal, lossless)
+Sweeping `num_speculative_tokens` (γ — the draft length) on fresh DFlash serves, baseline measured
+once (γ-independent: **19.95 tok/s**), decode tok/s over the 14-prompt mixed set, greedy, byte-parity
+vs baseline checked at *every* γ (`results/gamma_sweep.json`):
+| γ | tokens/sec | speedup | lossless |
+|---|---|---|---|
+| 3 | 44.72 | 2.24× | ✓ 0/14 |
+| 5 | 52.59 | 2.64× | ✓ 0/14 |
+| 7 (card default) | 51.74 | 2.59× | ✓ 0/14 |
+| **9 (γ\*)** | **52.96** | **2.65×** | ✓ 0/14 |
+| 11 | 48.40 | 2.43× | ✓ 0/14 |
+The curve **rises then falls**: it climbs from γ=3, plateaus across γ=5–9, **peaks at γ\*=9 (2.65×)**,
+then **regresses at γ=11 (2.43×)** — the classic acceptance/overhead tradeoff (past the point where
+extra drafted tokens are still accepted, more draft slots only raise verify cost and waste compute on
+rejects faster than they add accepted tokens). The card default **γ=7 sits within ~2.4% of the
+optimum**, and — the load-bearing point — **every γ is byte-lossless** (0/14 mismatches): the
+throughput-optimal γ is also exactly lossless. This is a third, independent corroboration of the
+headline (the 2.76× / 2.63× decode A/Bs being the first two).
+### Reward-invariance (by construction)
+The `spec_rl` dense unit-test reward (`fraction_passing`) scores a **mean 0.85** over a 12-problem
+HumanEval slice via the canonical eval path against hosted Laguna — and the **self-served vLLM
+baseline reproduces that 0.85 exactly** (`results/reward_invariance.json`), a clean corroboration.
+Reward-invariance under DFlash holds **by construction**: lossless greedy decode (proven byte-identical
+in the decode A/B at every γ) ⇒ identical rollout text ⇒ identical reward, just generated faster. We
+**do not claim DFlash improves reward.** A live reward probe of the γ=7 DFlash run returned a higher
+number than baseline with a few completions differing, but that is run-to-run greedy MoE
+nondeterminism across two separate serves on longer generations — *not* a DFlash quality change — so
+we decline to over-interpret it, the same discipline we apply to acceptance length τ. The
+by-construction guarantee, anchored on the measured byte-parity, is the claim that matters.
 ---
 ## How to reproduce