Dssd_Demo / README.md
valcore's picture
Update README.md
1e1f0a5 verified

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: DSSD Demo
emoji: πŸš€
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false
license: apache-2.0

πŸš€ Dynamic Self-Speculative Decoding (DSSD) Demo

This demo showcases early exit inference with true speculative decoding. Tokens are generated from intermediate layers when the model is confident, resulting in faster generation while guaranteeing output identical to the full model.

Features

  • Speculative Decoding: Uses early exit heads to draft tokens, then verifies them with the full model.
  • Streaming Output: Watch the generation process live, including drafting and verification statuses.
  • Model Comparison: Compare performance and output between DSSD and the full model side-by-side.
  • Color-coded Visualization: Each token is colored based on which head/layer generated it.

How it works

  1. Draft Phase: The model tries to predict the next token(s) using early exit heads placed at intermediate layers.
  2. Verification Phase: The full model checks the drafted tokens in a single forward pass.
  3. Acceptance: Matching tokens are kept. The first mismatch is corrected, and the process restarts.

Models

  • Llama 3 8B: Using 3 auxiliary heads at layers 8, 16, and 24.
  • Qwen 3 0.6B: Using 4 auxiliary heads at layers 5, 11, 16, and 22.

Quick Start (Local)

pip install -r requirements.txt
python app.py