maxholsman
/

fuzzy-spec-dec

Model card Files Files and versions

xet

Community

maxholsman commited on 14 days ago

Commit

0cef019

verified ·

1 Parent(s): dc6e82a

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +5 -7

README.md CHANGED Viewed

@@ -13,14 +13,12 @@ Standard Speculative Decoding enforces strict distributional equivalence to the
 **Key Benefits:**
 - **Tunable Quality-Speed Tradeoff**: Adjust the `fsd_threshold` parameter to control the balance between generation quality and inference speed
 - **Significant Speed Improvements**: Achieve runtime improvements of over 5 tokens per second faster than standard SD with only an approximate 2% absolute reduction in benchmark accuracy
-- **Maintained Performance**: In many cases, FSD matches standard SD benchmark accuracy while running over 2 tokens per second faster
 - **Flexible Acceptance Criteria**: Choose from multiple divergence metrics (KL divergence, Jensen-Shannon divergence, or draft token probabilities) to best suit your use case
-This implementation extends the standard speculative decoding algorithm with additional divergence metrics for more flexible candidate acceptance, supporting KL divergence, Jensen-Shannon divergence, and draft token-based acceptance criteria.
 This implementation is based on the paper **"Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff"** (ACL Findings 2025). See the [Citation](#citation) section below for full citation details.
-## Features
 - **Fuzzy Speculative Decoding (FSD)**: Accepts candidate tokens based on distribution divergence thresholds
 - **Multiple Divergence Types**:
@@ -74,10 +72,10 @@ print(generated_text)
 ### How It Works
-1. The assistant model generates candidate tokens
-2. The target model evaluates these candidates
 3. For each candidate position:
-   - If FSD divergence ≤ threshold: token is accepted
    - Otherwise: standard speculative decoding acceptance is applied
 4. Accepted tokens are kept, rejected tokens trigger resampling from the target model

 **Key Benefits:**
 - **Tunable Quality-Speed Tradeoff**: Adjust the `fsd_threshold` parameter to control the balance between generation quality and inference speed
 - **Significant Speed Improvements**: Achieve runtime improvements of over 5 tokens per second faster than standard SD with only an approximate 2% absolute reduction in benchmark accuracy
+- **Maintained Performance**: In many cases, FSD can match standard SD benchmark accuracy while running over 2 tokens per second faster
 - **Flexible Acceptance Criteria**: Choose from multiple divergence metrics (KL divergence, Jensen-Shannon divergence, or draft token probabilities) to best suit your use case
 This implementation is based on the paper **"Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff"** (ACL Findings 2025). See the [Citation](#citation) section below for full citation details.
+## How it works
 - **Fuzzy Speculative Decoding (FSD)**: Accepts candidate tokens based on distribution divergence thresholds
 - **Multiple Divergence Types**:
 ### How It Works
+1. The assistant model generates candidate tokens (just like standard SD)
+2. The target model evaluates these candidates, generating distributions for all draft tokens
 3. For each candidate position:
+   - If FSD divergence between the target and draft model distributions is less that the fsd_threshold: token is accepted
    - Otherwise: standard speculative decoding acceptance is applied
 4. Accepted tokens are kept, rejected tokens trigger resampling from the target model