File size: 1,435 Bytes
f835254 1e1f0a5 f835254 72b2f6d f835254 72b2f6d f835254 72b2f6d f835254 72b2f6d 1e1f0a5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
---
title: DSSD Demo
emoji: ๐
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false
license: apache-2.0
---
# ๐ Dynamic Self-Speculative Decoding (DSSD) Demo
This demo showcases **early exit inference** with true speculative decoding.
Tokens are generated from intermediate layers when the model is confident, resulting in faster generation while **guaranteeing output identical to the full model**.
## Features
- **Speculative Decoding**: Uses early exit heads to draft tokens, then verifies them with the full model.
- **Streaming Output**: Watch the generation process live, including drafting and verification statuses.
- **Model Comparison**: Compare performance and output between DSSD and the full model side-by-side.
- **Color-coded Visualization**: Each token is colored based on which head/layer generated it.
## How it works
1. **Draft Phase**: The model tries to predict the next token(s) using early exit heads placed at intermediate layers.
2. **Verification Phase**: The full model checks the drafted tokens in a single forward pass.
3. **Acceptance**: Matching tokens are kept. The first mismatch is corrected, and the process restarts.
## Models
- **Llama 3 8B**: Using 3 auxiliary heads at layers 8, 16, and 24.
- **Qwen 3 0.6B**: Using 4 auxiliary heads at layers 5, 11, 16, and 22.
## Quick Start (Local)
```bash
pip install -r requirements.txt
python app.py
``` |