Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -13,14 +13,12 @@ Standard Speculative Decoding enforces strict distributional equivalence to the
|
|
| 13 |
**Key Benefits:**
|
| 14 |
- **Tunable Quality-Speed Tradeoff**: Adjust the `fsd_threshold` parameter to control the balance between generation quality and inference speed
|
| 15 |
- **Significant Speed Improvements**: Achieve runtime improvements of over 5 tokens per second faster than standard SD with only an approximate 2% absolute reduction in benchmark accuracy
|
| 16 |
-
- **Maintained Performance**: In many cases, FSD
|
| 17 |
- **Flexible Acceptance Criteria**: Choose from multiple divergence metrics (KL divergence, Jensen-Shannon divergence, or draft token probabilities) to best suit your use case
|
| 18 |
|
| 19 |
-
This implementation extends the standard speculative decoding algorithm with additional divergence metrics for more flexible candidate acceptance, supporting KL divergence, Jensen-Shannon divergence, and draft token-based acceptance criteria.
|
| 20 |
-
|
| 21 |
This implementation is based on the paper **"Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff"** (ACL Findings 2025). See the [Citation](#citation) section below for full citation details.
|
| 22 |
|
| 23 |
-
##
|
| 24 |
|
| 25 |
- **Fuzzy Speculative Decoding (FSD)**: Accepts candidate tokens based on distribution divergence thresholds
|
| 26 |
- **Multiple Divergence Types**:
|
|
@@ -74,10 +72,10 @@ print(generated_text)
|
|
| 74 |
|
| 75 |
### How It Works
|
| 76 |
|
| 77 |
-
1. The assistant model generates candidate tokens
|
| 78 |
-
2. The target model evaluates these candidates
|
| 79 |
3. For each candidate position:
|
| 80 |
-
- If FSD divergence
|
| 81 |
- Otherwise: standard speculative decoding acceptance is applied
|
| 82 |
4. Accepted tokens are kept, rejected tokens trigger resampling from the target model
|
| 83 |
|
|
|
|
| 13 |
**Key Benefits:**
|
| 14 |
- **Tunable Quality-Speed Tradeoff**: Adjust the `fsd_threshold` parameter to control the balance between generation quality and inference speed
|
| 15 |
- **Significant Speed Improvements**: Achieve runtime improvements of over 5 tokens per second faster than standard SD with only an approximate 2% absolute reduction in benchmark accuracy
|
| 16 |
+
- **Maintained Performance**: In many cases, FSD can match standard SD benchmark accuracy while running over 2 tokens per second faster
|
| 17 |
- **Flexible Acceptance Criteria**: Choose from multiple divergence metrics (KL divergence, Jensen-Shannon divergence, or draft token probabilities) to best suit your use case
|
| 18 |
|
|
|
|
|
|
|
| 19 |
This implementation is based on the paper **"Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff"** (ACL Findings 2025). See the [Citation](#citation) section below for full citation details.
|
| 20 |
|
| 21 |
+
## How it works
|
| 22 |
|
| 23 |
- **Fuzzy Speculative Decoding (FSD)**: Accepts candidate tokens based on distribution divergence thresholds
|
| 24 |
- **Multiple Divergence Types**:
|
|
|
|
| 72 |
|
| 73 |
### How It Works
|
| 74 |
|
| 75 |
+
1. The assistant model generates candidate tokens (just like standard SD)
|
| 76 |
+
2. The target model evaluates these candidates, generating distributions for all draft tokens
|
| 77 |
3. For each candidate position:
|
| 78 |
+
- If FSD divergence between the target and draft model distributions is less that the fsd_threshold: token is accepted
|
| 79 |
- Otherwise: standard speculative decoding acceptance is applied
|
| 80 |
4. Accepted tokens are kept, rejected tokens trigger resampling from the target model
|
| 81 |
|