Fuzzy Speculative Decoding
Fuzzy Speculative Decoding (FSD) is a decoding algorithm that generalizes standard Speculative Decoding (SD) by accepting candidate tokens based on distribution divergence thresholds rather than enforcing strict distributional equivalence. This enables a tunable tradeoff between generation quality and inference speed, allowing users to flexibly balance quality and runtime based on their specific needs.
Motivation
Standard Speculative Decoding enforces strict distributional equivalence to the target model, which limits potential speedups. However, distributions of near-equivalence often achieve comparable outcomes in practice. By allowing controlled divergence from the target model distribution, FSD enables users to trade small deviations in generation quality for significant inference speed gains.
Key Benefits:
- Tunable Quality-Speed Tradeoff: Adjust the
fsd_thresholdparameter to control the balance between generation quality and inference speed - Significant Speed Improvements: Achieve runtime improvements of over 5 tokens per second faster than standard SD with only an approximate 2% absolute reduction in benchmark accuracy
- Maintained Performance: In many cases, FSD matches standard SD benchmark accuracy while running over 2 tokens per second faster
- Flexible Acceptance Criteria: Choose from multiple divergence metrics (KL divergence, Jensen-Shannon divergence, or draft token probabilities) to best suit your use case
This implementation extends the standard speculative decoding algorithm with additional divergence metrics for more flexible candidate acceptance, supporting KL divergence, Jensen-Shannon divergence, and draft token-based acceptance criteria.
This implementation is based on the paper "Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff" (ACL Findings 2025). See the Citation section below for full citation details.
Features
- Fuzzy Speculative Decoding (FSD): Accepts candidate tokens based on distribution divergence thresholds
- Multiple Divergence Types:
kl: KL divergence between candidate and target distributionsjs: Jensen-Shannon divergencedraft_tokens: Absolute difference in draft token probabilities
- Standard Speculative Decoding: Falls back to standard speculative decoding acceptance when FSD threshold is not met
Usage
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load models
target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
assistant_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
# Prepare input
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate with custom fuzzy speculative decoding
outputs = target_model.generate(
**inputs,
assistant_model=assistant_model,
custom_generate="maxholsman/fuzzy-spec-dec",
trust_remote_code=True,
fsd_threshold=0.0, # FSD acceptance threshold
fsd_div_type="kl", # Divergence type: "kl", "js", or "draft_tokens"
do_sample=True,
temperature=0.7,
max_new_tokens=100,
)
# Decode result
generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
print(generated_text)
Custom Parameters
fsd_threshold(float, default: 0.0): Threshold for fuzzy speculative decoding acceptance. Tokens with divergence below this threshold are automatically accepted. Lower values enforce stricter equivalence (closer to standard SD, higher quality but slower), while higher values allow more divergence (faster inference with potential quality tradeoffs). Tune this parameter to achieve your desired quality-speed tradeoff.fsd_div_type(str, default: "kl"): Type of divergence metric to use:"kl": KL divergence (D_KL(candidate || target)) - measures how much information is lost when using the candidate distribution to approximate the target"js": Jensen-Shannon divergence - a symmetric and bounded measure of distribution similarity"draft_tokens": Absolute difference in draft token probabilities - simpler metric based on probability differences
How It Works
- The assistant model generates candidate tokens
- The target model evaluates these candidates
- For each candidate position:
- If FSD divergence โค threshold: token is accepted
- Otherwise: standard speculative decoding acceptance is applied
- Accepted tokens are kept, rejected tokens trigger resampling from the target model
Citation
If you use this code in your research, please cite the original paper:
@article{holsman2025fuzzy,
title={Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff},
author={Holsman, Maximilian and Huang, Yukun and Dhingra, Bhuwan},
journal={ACL Findings},
year={2025},
url={https://arxiv.org/abs/2502.20704}
}
Paper: Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff
Authors: Maximilian Holsman, Yukun Huang, Bhuwan Dhingra
Venue: ACL Findings 2025
License
Apache 2.0