art87able commited on
Commit
c0da39b
·
verified ·
1 Parent(s): ee4e9e6

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +36 -0
README.md CHANGED
@@ -146,6 +146,42 @@ H200 — the lossless speedup reproduced (2.63×; run-to-run variance ~2.6–2.8
146
  next step. Byte-identical greedy output ⇒ identical pass@1 *by construction*, so parity is the
147
  stronger guarantee here.
148
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
  ---
150
 
151
  ## How to reproduce
 
146
  next step. Byte-identical greedy output ⇒ identical pass@1 *by construction*, so parity is the
147
  stronger guarantee here.
148
 
149
+ ### γ-sweep (throughput-optimal, lossless)
150
+
151
+ Sweeping `num_speculative_tokens` (γ — the draft length) on fresh DFlash serves, baseline measured
152
+ once (γ-independent: **19.95 tok/s**), decode tok/s over the 14-prompt mixed set, greedy, byte-parity
153
+ vs baseline checked at *every* γ (`results/gamma_sweep.json`):
154
+
155
+ | γ | tokens/sec | speedup | lossless |
156
+ |---|---|---|---|
157
+ | 3 | 44.72 | 2.24× | ✓ 0/14 |
158
+ | 5 | 52.59 | 2.64× | ✓ 0/14 |
159
+ | 7 (card default) | 51.74 | 2.59× | ✓ 0/14 |
160
+ | **9 (γ\*)** | **52.96** | **2.65×** | ✓ 0/14 |
161
+ | 11 | 48.40 | 2.43× | ✓ 0/14 |
162
+
163
+ The curve **rises then falls**: it climbs from γ=3, plateaus across γ=5–9, **peaks at γ\*=9 (2.65×)**,
164
+ then **regresses at γ=11 (2.43×)** — the classic acceptance/overhead tradeoff (past the point where
165
+ extra drafted tokens are still accepted, more draft slots only raise verify cost and waste compute on
166
+ rejects faster than they add accepted tokens). The card default **γ=7 sits within ~2.4% of the
167
+ optimum**, and — the load-bearing point — **every γ is byte-lossless** (0/14 mismatches): the
168
+ throughput-optimal γ is also exactly lossless. This is a third, independent corroboration of the
169
+ headline (the 2.76× / 2.63× decode A/Bs being the first two).
170
+
171
+ ### Reward-invariance (by construction)
172
+
173
+ The `spec_rl` dense unit-test reward (`fraction_passing`) scores a **mean 0.85** over a 12-problem
174
+ HumanEval slice via the canonical eval path against hosted Laguna — and the **self-served vLLM
175
+ baseline reproduces that 0.85 exactly** (`results/reward_invariance.json`), a clean corroboration.
176
+
177
+ Reward-invariance under DFlash holds **by construction**: lossless greedy decode (proven byte-identical
178
+ in the decode A/B at every γ) ⇒ identical rollout text ⇒ identical reward, just generated faster. We
179
+ **do not claim DFlash improves reward.** A live reward probe of the γ=7 DFlash run returned a higher
180
+ number than baseline with a few completions differing, but that is run-to-run greedy MoE
181
+ nondeterminism across two separate serves on longer generations — *not* a DFlash quality change — so
182
+ we decline to over-interpret it, the same discipline we apply to acceptance length τ. The
183
+ by-construction guarantee, anchored on the measured byte-parity, is the claim that matters.
184
+
185
  ---
186
 
187
  ## How to reproduce