Fuzzy Speculative Decoding

Fuzzy Speculative Decoding (FSD) is a decoding algorithm that generalizes standard Speculative Decoding (SD) by accepting candidate tokens based on distribution divergence thresholds rather than enforcing strict distributional equivalence. This enables a tunable tradeoff between generation quality and inference speed, allowing users to flexibly balance quality and runtime based on their specific needs.

Motivation

Standard Speculative Decoding enforces strict distributional equivalence to the target model, which limits potential speedups. However, distributions of near-equivalence often achieve comparable outcomes in practice. By allowing controlled divergence from the target model distribution, FSD enables users to trade small deviations in generation quality for significant inference speed gains.

Key Benefits:

Tunable Quality-Speed Tradeoff: Adjust the fsd_threshold parameter to control the balance between generation quality and inference speed
Significant Speed Improvements: Achieve runtime improvements of over 5 tokens per second faster than standard SD with only an approximate 2% absolute reduction in benchmark accuracy
Maintained Performance: In many cases, FSD can match standard SD benchmark accuracy while running over 2 tokens per second faster
Flexible Acceptance Criteria: Choose from multiple divergence metrics (KL divergence, JS divergence variant, or draft token probabilities) to best suit your use case

This implementation is based on the paper "Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff" (ACL Findings 2025). See the Citation section below for full citation details.

Features

Fuzzy Speculative Decoding (FSD): Accepts candidate tokens based on distribution divergence thresholds
Multiple Divergence Types:
- kl: KL divergence between candidate and target distributions
- js: JS divergence variant (computed using KL divergence with midpoint distribution)
- draft_tokens: Absolute difference in draft token probabilities
Standard Speculative Decoding: Falls back to standard speculative decoding acceptance when FSD threshold is not met

Usage

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load models
target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
assistant_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# Prepare input
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate with custom fuzzy speculative decoding
outputs = target_model.generate(
    **inputs,
    assistant_model=assistant_model,
    custom_generate="maxholsman/fuzzy-spec-dec",
    trust_remote_code=True,
    fsd_threshold=0.0,        # FSD acceptance threshold
    fsd_div_type="kl",        # Divergence type: "kl", "js", or "draft_tokens"
    do_sample=True,
    temperature=0.7,
    max_new_tokens=100,
)

# Decode result
generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
print(generated_text)

Custom Parameters

fsd_threshold (float, default: 0.0): Threshold for fuzzy speculative decoding acceptance. Tokens with divergence below this threshold are automatically accepted. Lower values enforce stricter equivalence (closer to standard SD, higher quality but slower), while higher values allow more divergence (faster inference with potential quality tradeoffs). Tune this parameter to achieve your desired quality-speed tradeoff.
fsd_div_type (str, default: "kl"): Type of divergence metric to use:
- "kl": KL divergence (D_KL(candidate || target)) - measures how much information is lost when using the candidate distribution to approximate the target
- "js": JS divergence variant - computed using KL divergence with a midpoint distribution, providing a symmetric and bounded measure of distribution similarity
- "draft_tokens": Absolute difference between draft and target model probability of drafted token
track_acceptance_metrics (bool, default: False): Whether to track and return draft token acceptance statistics. When enabled, the output includes:
- draft_token_acceptance_rate: Ratio of accepted draft tokens to total draft tokens
- total_draft_tokens: Total number of draft tokens generated
- total_accepted_tokens: Total number of draft tokens accepted Set to True when you need to analyze acceptance rates for performance evaluation. When False (default), no tracking computation is performed, minimizing overhead for production use.

How It Works

The assistant model generates candidate tokens (just like standard SD)
The target model evaluates these candidates, generating distributions for all draft tokens
For each candidate position:
- If FSD divergence between the target and draft model distributions is less that the fsd_threshold: token is accepted
- Otherwise: standard speculative decoding acceptance is applied
Accepted tokens are kept, rejected tokens trigger resampling from the target model

Citation

If you use this code in your research, please cite the original paper:

@article{holsman2025fuzzy,
  title={Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff},
  author={Holsman, Maximilian and Huang, Yukun and Dhingra, Bhuwan},
  journal={ACL Findings},
  year={2025},
  url={https://arxiv.org/abs/2502.20704}
}

Paper: Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff
Authors: Maximilian Holsman, Yukun Huang, Bhuwan Dhingra
Venue: ACL Findings 2025

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for maxholsman/fuzzy-spec-dec

Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff

Paper • 2502.20704 • Published Feb 28, 2025