Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -146,6 +146,42 @@ H200 — the lossless speedup reproduced (2.63×; run-to-run variance ~2.6–2.8
|
|
| 146 |
next step. Byte-identical greedy output ⇒ identical pass@1 *by construction*, so parity is the
|
| 147 |
stronger guarantee here.
|
| 148 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
---
|
| 150 |
|
| 151 |
## How to reproduce
|
|
|
|
| 146 |
next step. Byte-identical greedy output ⇒ identical pass@1 *by construction*, so parity is the
|
| 147 |
stronger guarantee here.
|
| 148 |
|
| 149 |
+
### γ-sweep (throughput-optimal, lossless)
|
| 150 |
+
|
| 151 |
+
Sweeping `num_speculative_tokens` (γ — the draft length) on fresh DFlash serves, baseline measured
|
| 152 |
+
once (γ-independent: **19.95 tok/s**), decode tok/s over the 14-prompt mixed set, greedy, byte-parity
|
| 153 |
+
vs baseline checked at *every* γ (`results/gamma_sweep.json`):
|
| 154 |
+
|
| 155 |
+
| γ | tokens/sec | speedup | lossless |
|
| 156 |
+
|---|---|---|---|
|
| 157 |
+
| 3 | 44.72 | 2.24× | ✓ 0/14 |
|
| 158 |
+
| 5 | 52.59 | 2.64× | ✓ 0/14 |
|
| 159 |
+
| 7 (card default) | 51.74 | 2.59× | ✓ 0/14 |
|
| 160 |
+
| **9 (γ\*)** | **52.96** | **2.65×** | ✓ 0/14 |
|
| 161 |
+
| 11 | 48.40 | 2.43× | ✓ 0/14 |
|
| 162 |
+
|
| 163 |
+
The curve **rises then falls**: it climbs from γ=3, plateaus across γ=5–9, **peaks at γ\*=9 (2.65×)**,
|
| 164 |
+
then **regresses at γ=11 (2.43×)** — the classic acceptance/overhead tradeoff (past the point where
|
| 165 |
+
extra drafted tokens are still accepted, more draft slots only raise verify cost and waste compute on
|
| 166 |
+
rejects faster than they add accepted tokens). The card default **γ=7 sits within ~2.4% of the
|
| 167 |
+
optimum**, and — the load-bearing point — **every γ is byte-lossless** (0/14 mismatches): the
|
| 168 |
+
throughput-optimal γ is also exactly lossless. This is a third, independent corroboration of the
|
| 169 |
+
headline (the 2.76× / 2.63× decode A/Bs being the first two).
|
| 170 |
+
|
| 171 |
+
### Reward-invariance (by construction)
|
| 172 |
+
|
| 173 |
+
The `spec_rl` dense unit-test reward (`fraction_passing`) scores a **mean 0.85** over a 12-problem
|
| 174 |
+
HumanEval slice via the canonical eval path against hosted Laguna — and the **self-served vLLM
|
| 175 |
+
baseline reproduces that 0.85 exactly** (`results/reward_invariance.json`), a clean corroboration.
|
| 176 |
+
|
| 177 |
+
Reward-invariance under DFlash holds **by construction**: lossless greedy decode (proven byte-identical
|
| 178 |
+
in the decode A/B at every γ) ⇒ identical rollout text ⇒ identical reward, just generated faster. We
|
| 179 |
+
**do not claim DFlash improves reward.** A live reward probe of the γ=7 DFlash run returned a higher
|
| 180 |
+
number than baseline with a few completions differing, but that is run-to-run greedy MoE
|
| 181 |
+
nondeterminism across two separate serves on longer generations — *not* a DFlash quality change — so
|
| 182 |
+
we decline to over-interpret it, the same discipline we apply to acceptance length τ. The
|
| 183 |
+
by-construction guarantee, anchored on the measured byte-parity, is the claim that matters.
|
| 184 |
+
|
| 185 |
---
|
| 186 |
|
| 187 |
## How to reproduce
|