File size: 1,435 Bytes
f835254
 
 
 
 
 
1e1f0a5
f835254
 
 
 
 
 
 
 
 
72b2f6d
 
f835254
 
 
 
 
 
 
 
 
72b2f6d
f835254
 
 
72b2f6d
f835254
72b2f6d
 
 
 
1e1f0a5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
---
title: DSSD Demo
emoji: ๐Ÿš€
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false
license: apache-2.0
---

# ๐Ÿš€ Dynamic Self-Speculative Decoding (DSSD) Demo

This demo showcases **early exit inference** with true speculative decoding. 
Tokens are generated from intermediate layers when the model is confident, resulting in faster generation while **guaranteeing output identical to the full model**.

## Features
- **Speculative Decoding**: Uses early exit heads to draft tokens, then verifies them with the full model.
- **Streaming Output**: Watch the generation process live, including drafting and verification statuses.
- **Model Comparison**: Compare performance and output between DSSD and the full model side-by-side.
- **Color-coded Visualization**: Each token is colored based on which head/layer generated it.

## How it works
1. **Draft Phase**: The model tries to predict the next token(s) using early exit heads placed at intermediate layers.
2. **Verification Phase**: The full model checks the drafted tokens in a single forward pass.
3. **Acceptance**: Matching tokens are kept. The first mismatch is corrected, and the process restarts.

## Models
- **Llama 3 8B**: Using 3 auxiliary heads at layers 8, 16, and 24.
- **Qwen 3 0.6B**: Using 4 auxiliary heads at layers 5, 11, 16, and 22.

## Quick Start (Local)

```bash
pip install -r requirements.txt
python app.py
```