SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection
Paper • 2605.02888 • Published • 1
This is the trained acceptance rate predictor from the SpecKV paper. It selects the optimal speculation length (gamma) per step using draft model signals, achieving 56.0% more tokens per speculation step than the fixed gamma=4 default.
import pickle
import numpy as np
# load model
with open("speckv_mlp16.pkl", "rb") as f:
model = pickle.load(f)
# at each speculation step, extract these from draft token distributions:
draft_entropy = 1.5 # mean entropy across draft tokens
draft_confidence = 0.72 # mean top-1 confidence
max_entropy = 2.3 # max entropy in the step
min_confidence = 0.45 # min confidence in the step
comp_enc = 0 # 0=fp16, 1=int8, 2=nf4
# pick best gamma
best_gamma, best_expected = 2, 0
for gamma in [2, 4, 6, 8]:
features = np.array([[draft_entropy, draft_confidence, max_entropy, min_confidence, comp_enc, gamma]])
pred_ar = np.clip(model.predict(features)[0], 0, 1)
expected_tokens = pred_ar * gamma + 1
if expected_tokens > best_expected:
best_expected = expected_tokens
best_gamma = gamma
print(f"Use gamma={best_gamma} (expected {best_expected:.1f} tokens)")
If you do not want a sklearn dependency, load the raw weights:
import numpy as np
weights = np.load("speckv_mlp16_weights.npz")
W1, b1 = weights["W1"], weights["b1"] # (6, 16), (16,)
W2, b2 = weights["W2"], weights["b2"] # (16, 1), (1,)
def predict(x):
h = np.maximum(0, x @ W1 + b1) # ReLU
return float(h @ W2 + b2)
| Property | Value |
|---|---|
| Architecture | MLP, 1 hidden layer, 16 units, ReLU |
| Input | 6 features (entropy, confidence, max/min variants, compression, gamma) |
| Output | Acceptance rate prediction (0-1) |
| Training data | 5,112 step-level records |
| Test MSE | 0.090 |
| Test correlation | 0.685 |
| Decision overhead | 0.34ms (4 predictions per decision) |
| Improvement over fixed gamma=4 | 56.0% |
| Statistical significance | p < 0.001 |
speckv_mlp16.pkl - Full scikit-learn model (pickle)speckv_mlp16_weights.npz - Raw numpy weights (W1, b1, W2, b2)config.json - Model configuration and metadatarequirements.txt - Python dependencies@article{shukla2026speckv,
title={SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection},
author={Shukla, Shikhar},
journal={arXiv preprint},
year={2026}
}