SpecKV-MLP16: Adaptive Gamma Selector for Speculative Decoding

This is the trained acceptance rate predictor from the SpecKV paper. It selects the optimal speculation length (gamma) per step using draft model signals, achieving 56.0% more tokens per speculation step than the fixed gamma=4 default.

Quick Start

import pickle
import numpy as np

# load model
with open("speckv_mlp16.pkl", "rb") as f:
    model = pickle.load(f)

# at each speculation step, extract these from draft token distributions:
draft_entropy = 1.5       # mean entropy across draft tokens
draft_confidence = 0.72   # mean top-1 confidence
max_entropy = 2.3         # max entropy in the step
min_confidence = 0.45     # min confidence in the step
comp_enc = 0              # 0=fp16, 1=int8, 2=nf4

# pick best gamma
best_gamma, best_expected = 2, 0
for gamma in [2, 4, 6, 8]:
    features = np.array([[draft_entropy, draft_confidence, max_entropy, min_confidence, comp_enc, gamma]])
    pred_ar = np.clip(model.predict(features)[0], 0, 1)
    expected_tokens = pred_ar * gamma + 1
    if expected_tokens > best_expected:
        best_expected = expected_tokens
        best_gamma = gamma

print(f"Use gamma={best_gamma} (expected {best_expected:.1f} tokens)")

Framework-Agnostic Loading

If you do not want a sklearn dependency, load the raw weights:

import numpy as np

weights = np.load("speckv_mlp16_weights.npz")
W1, b1 = weights["W1"], weights["b1"]  # (6, 16), (16,)
W2, b2 = weights["W2"], weights["b2"]  # (16, 1), (1,)

def predict(x):
    h = np.maximum(0, x @ W1 + b1)  # ReLU
    return float(h @ W2 + b2)

Model Details

Property Value
Architecture MLP, 1 hidden layer, 16 units, ReLU
Input 6 features (entropy, confidence, max/min variants, compression, gamma)
Output Acceptance rate prediction (0-1)
Training data 5,112 step-level records
Test MSE 0.090
Test correlation 0.685
Decision overhead 0.34ms (4 predictions per decision)
Improvement over fixed gamma=4 56.0%
Statistical significance p < 0.001

Files

  • speckv_mlp16.pkl - Full scikit-learn model (pickle)
  • speckv_mlp16_weights.npz - Raw numpy weights (W1, b1, W2, b2)
  • config.json - Model configuration and metadata
  • requirements.txt - Python dependencies

Citation

@article{shukla2026speckv,
  title={SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection},
  author={Shukla, Shikhar},
  journal={arXiv preprint},
  year={2026}
}

Links

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Shikhar1/SpecKV