Update README with honest 3x averaged benchmark numbers
Browse files
README.md
CHANGED
|
@@ -1,7 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# TokForge Acceleration Pack — Qwen3 KL-Distilled Draft Model
|
| 2 |
|
| 3 |
## Overview
|
| 4 |
-
KL-distilled Qwen3-0.6B draft model for speculative decoding in TokForge. Trained on 10,000 teacher samples from Qwen3-8B, achieving **+
|
| 5 |
|
| 6 |
## What This Is
|
| 7 |
A small (0.6B) draft model that:
|
|
@@ -10,14 +23,15 @@ A small (0.6B) draft model that:
|
|
| 10 |
- KL distillation matches the teacher's full logit distribution, not just top-1 tokens
|
| 11 |
- Results in significantly higher acceptance rates than stock or abliterated drafts
|
| 12 |
|
| 13 |
-
##
|
| 14 |
|
| 15 |
-
|
|
| 16 |
-
|---|---|---|---|
|
| 17 |
-
|
|
| 18 |
-
|
|
| 19 |
-
|
|
| 20 |
-
|
|
|
|
| 21 |
|
| 22 |
## Why KL Distillation?
|
| 23 |
|
|
@@ -43,8 +57,9 @@ KL divergence loss teaches the draft model to match the teacher's probability di
|
|
| 43 |
|
| 44 |
## Optimal Draft Config
|
| 45 |
|
| 46 |
-
|
| 47 |
|
|
|
|
| 48 |
```json
|
| 49 |
{
|
| 50 |
"backend_type": "cpu",
|
|
@@ -59,19 +74,19 @@ The draft model performs best with `thread_num: 2` and `power: high`:
|
|
| 59 |
**Why thread_num: 2?** On Android's WALT CPU governor, using too many threads (4+) can cause the scheduler to spread work across efficiency cores at low frequency. 2 threads stay on performance cores at high clock speeds.
|
| 60 |
|
| 61 |
## Compatible Target Models
|
| 62 |
-
- Qwen3-4B (MNN)
|
| 63 |
-
- Qwen3-8B (MNN) — primary test target
|
| 64 |
-
- Qwen3-14B (MNN)
|
| 65 |
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
## SoC Compatibility
|
| 69 |
|
| 70 |
| SoC | GPU | Uplift | Notes |
|
| 71 |
|---|---|---|---|
|
| 72 |
-
| SM8850 (S26
|
|
|
|
| 73 |
| SM8650 (S24/Lenovo) | Adreno 750 | Testing | May regress on 9B |
|
| 74 |
-
| SM8635 (Xiaomi) | Adreno 735 | Testing | Limited by RAM |
|
| 75 |
|
| 76 |
## Usage
|
| 77 |
Download via TokForge app: Settings > Spec Decode > Download Acceleration Pack
|
|
@@ -82,7 +97,7 @@ Download via TokForge app: Settings > Spec Decode > Download Acceleration Pack
|
|
| 82 |
|---|---|---|---|---|
|
| 83 |
| v1 (abliterated) | — | — | +20% | 2026-03-19 |
|
| 84 |
| v2 (KL 1K) | 1,024 | 0.43 | +17% | 2026-03-21 |
|
| 85 |
-
| **v3 (KL 10K)** | **10,000** | **0.339** | **+
|
| 86 |
|
| 87 |
---
|
| 88 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language: en
|
| 4 |
+
tags:
|
| 5 |
+
- mnn
|
| 6 |
+
- speculative-decoding
|
| 7 |
+
- draft-model
|
| 8 |
+
- qwen3
|
| 9 |
+
- tokforge
|
| 10 |
+
base_model: Qwen/Qwen3-0.6B
|
| 11 |
+
pipeline_tag: text-generation
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
# TokForge Acceleration Pack — Qwen3 KL-Distilled Draft Model
|
| 15 |
|
| 16 |
## Overview
|
| 17 |
+
KL-distilled Qwen3-0.6B draft model for speculative decoding in TokForge. Trained on 10,000 teacher samples from Qwen3-8B, achieving **+24-54% decode speed** across SM8850 and SM8635 devices (3-run averaged, 500-token DNS prose).
|
| 18 |
|
| 19 |
## What This Is
|
| 20 |
A small (0.6B) draft model that:
|
|
|
|
| 23 |
- KL distillation matches the teacher's full logit distribution, not just top-1 tokens
|
| 24 |
- Results in significantly higher acceptance rates than stock or abliterated drafts
|
| 25 |
|
| 26 |
+
## Benchmark Results (3-run averaged, 500-token DNS prose)
|
| 27 |
|
| 28 |
+
| Device | SoC | AR Baseline | With KL v2 Draft | Uplift |
|
| 29 |
+
|--------|-----|------------|-----------------|--------|
|
| 30 |
+
| RedMagic 11 Pro | SM8850 | 11.45 +/- 1.12 tok/s | 15.98 +/- 0.65 tok/s | **+39.5%** |
|
| 31 |
+
| Samsung S26 Ultra | SM8850 | 10.89 +/- 0.30 tok/s | 13.51 +/- 0.49 tok/s | **+24.0%** |
|
| 32 |
+
| Xiaomi Pad 7 Pro | SM8635 | 6.02 +/- 0.33 tok/s | 9.27 +/- 0.27 tok/s | **+54.0%** |
|
| 33 |
+
|
| 34 |
+
All numbers are 3-run averages on 500-token DNS prose with matched AR baselines.
|
| 35 |
|
| 36 |
## Why KL Distillation?
|
| 37 |
|
|
|
|
| 57 |
|
| 58 |
## Optimal Draft Config
|
| 59 |
|
| 60 |
+
Best with: target OpenCL, draft CPU, d=3, thread_num=2, power=high.
|
| 61 |
|
| 62 |
+
`config_cpu.json`:
|
| 63 |
```json
|
| 64 |
{
|
| 65 |
"backend_type": "cpu",
|
|
|
|
| 74 |
**Why thread_num: 2?** On Android's WALT CPU governor, using too many threads (4+) can cause the scheduler to spread work across efficiency cores at low frequency. 2 threads stay on performance cores at high clock speeds.
|
| 75 |
|
| 76 |
## Compatible Target Models
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
+
- **Qwen3-8B**: +24-40% uplift
|
| 79 |
+
- **Qwen3-14B**: +40-70% uplift
|
| 80 |
+
- **Qwen3-4B**: Disabled (degenerates — KL trained from 8B teacher)
|
| 81 |
+
- **Qwen3.5**: Not compatible (different architecture: LinearAttention vs full MHA)
|
| 82 |
|
| 83 |
## SoC Compatibility
|
| 84 |
|
| 85 |
| SoC | GPU | Uplift | Notes |
|
| 86 |
|---|---|---|---|
|
| 87 |
+
| SM8850 (RedMagic/S26) | Adreno 840 | **+24-40%** | Primary targets |
|
| 88 |
+
| SM8635 (Xiaomi Pad 7 Pro) | Adreno 735 | **+54%** | Best relative uplift |
|
| 89 |
| SM8650 (S24/Lenovo) | Adreno 750 | Testing | May regress on 9B |
|
|
|
|
| 90 |
|
| 91 |
## Usage
|
| 92 |
Download via TokForge app: Settings > Spec Decode > Download Acceleration Pack
|
|
|
|
| 97 |
|---|---|---|---|---|
|
| 98 |
| v1 (abliterated) | — | — | +20% | 2026-03-19 |
|
| 99 |
| v2 (KL 1K) | 1,024 | 0.43 | +17% | 2026-03-21 |
|
| 100 |
+
| **v3 (KL 10K)** | **10,000** | **0.339** | **+24-54%** | **2026-03-22** |
|
| 101 |
|
| 102 |
---
|
| 103 |
|