A newer version of the Gradio SDK is available:
6.5.1
metadata
title: DSSD Demo
emoji: π
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false
license: apache-2.0
π Dynamic Self-Speculative Decoding (DSSD) Demo
This demo showcases early exit inference with true speculative decoding. Tokens are generated from intermediate layers when the model is confident, resulting in faster generation while guaranteeing output identical to the full model.
Features
- Speculative Decoding: Uses early exit heads to draft tokens, then verifies them with the full model.
- Streaming Output: Watch the generation process live, including drafting and verification statuses.
- Model Comparison: Compare performance and output between DSSD and the full model side-by-side.
- Color-coded Visualization: Each token is colored based on which head/layer generated it.
How it works
- Draft Phase: The model tries to predict the next token(s) using early exit heads placed at intermediate layers.
- Verification Phase: The full model checks the drafted tokens in a single forward pass.
- Acceptance: Matching tokens are kept. The first mismatch is corrected, and the process restarts.
Models
- Llama 3 8B: Using 3 auxiliary heads at layers 8, 16, and 24.
- Qwen 3 0.6B: Using 4 auxiliary heads at layers 5, 11, 16, and 22.
Quick Start (Local)
pip install -r requirements.txt
python app.py