Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -4,9 +4,21 @@ license: apache-2.0
|
|
| 4 |
|
| 5 |
# Fuzzy Speculative Decoding
|
| 6 |
|
| 7 |
-
|
| 8 |
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
## Features
|
| 12 |
|
|
@@ -62,11 +74,11 @@ print(generated_text)
|
|
| 62 |
|
| 63 |
### Custom Parameters
|
| 64 |
|
| 65 |
-
- **`fsd_threshold`** (float, default: 0.0): Threshold for fuzzy speculative decoding acceptance. Tokens with divergence below this threshold are automatically accepted.
|
| 66 |
- **`fsd_div_type`** (str, default: "kl"): Type of divergence metric to use:
|
| 67 |
-
- `"kl"`: KL divergence (D_KL(candidate || target))
|
| 68 |
-
- `"js"`: Jensen-Shannon divergence
|
| 69 |
-
- `"draft_tokens"`: Absolute difference in draft token probabilities
|
| 70 |
|
| 71 |
### How It Works
|
| 72 |
|
|
|
|
| 4 |
|
| 5 |
# Fuzzy Speculative Decoding
|
| 6 |
|
| 7 |
+
Fuzzy Speculative Decoding (FSD) is a decoding algorithm that generalizes standard Speculative Decoding (SD) by accepting candidate tokens based on distribution divergence thresholds rather than enforcing strict distributional equivalence. This enables a **tunable tradeoff between generation quality and inference speed**, allowing users to flexibly balance accuracy and runtime based on their specific needs.
|
| 8 |
|
| 9 |
+
## Motivation
|
| 10 |
+
|
| 11 |
+
Standard Speculative Decoding enforces strict distributional equivalence to the target model, which limits potential speedups. However, distributions of near-equivalence often achieve comparable outcomes in practice. By allowing controlled divergence from the target model distribution, FSD enables users to trade small deviations in generation quality for significant inference speed gains.
|
| 12 |
+
|
| 13 |
+
**Key Benefits:**
|
| 14 |
+
- **Tunable Quality-Speed Tradeoff**: Adjust the `fsd_threshold` parameter to control the balance between generation quality and inference speed
|
| 15 |
+
- **Significant Speed Improvements**: Achieve runtime improvements of over 5 tokens per second faster than standard SD with only an approximate 2% absolute reduction in benchmark accuracy
|
| 16 |
+
- **Maintained Performance**: In many cases, FSD matches standard SD benchmark accuracy while running over 2 tokens per second faster
|
| 17 |
+
- **Flexible Acceptance Criteria**: Choose from multiple divergence metrics (KL divergence, Jensen-Shannon divergence, or draft token probabilities) to best suit your use case
|
| 18 |
+
|
| 19 |
+
This implementation extends the standard speculative decoding algorithm with additional divergence metrics for more flexible candidate acceptance, supporting KL divergence, Jensen-Shannon divergence, and draft token-based acceptance criteria.
|
| 20 |
+
|
| 21 |
+
This implementation is based on the paper **"Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff"** (ACL Findings 2025). See the [Citation](#citation) section below for full citation details.
|
| 22 |
|
| 23 |
## Features
|
| 24 |
|
|
|
|
| 74 |
|
| 75 |
### Custom Parameters
|
| 76 |
|
| 77 |
+
- **`fsd_threshold`** (float, default: 0.0): Threshold for fuzzy speculative decoding acceptance. Tokens with divergence below this threshold are automatically accepted. **Lower values** enforce stricter equivalence (closer to standard SD, higher quality but slower), while **higher values** allow more divergence (faster inference with potential quality tradeoffs). Tune this parameter to achieve your desired quality-speed tradeoff.
|
| 78 |
- **`fsd_div_type`** (str, default: "kl"): Type of divergence metric to use:
|
| 79 |
+
- `"kl"`: KL divergence (D_KL(candidate || target)) - measures how much information is lost when using the candidate distribution to approximate the target
|
| 80 |
+
- `"js"`: Jensen-Shannon divergence - a symmetric and bounded measure of distribution similarity
|
| 81 |
+
- `"draft_tokens"`: Absolute difference in draft token probabilities - simpler metric based on probability differences
|
| 82 |
|
| 83 |
### How It Works
|
| 84 |
|