darkmaniac7 commited on
Commit
2d9453c
·
verified ·
1 Parent(s): 5531dd9

Update README with honest 3x averaged benchmark numbers

Browse files
Files changed (1) hide show
  1. README.md +31 -16
README.md CHANGED
@@ -1,7 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # TokForge Acceleration Pack — Qwen3 KL-Distilled Draft Model
2
 
3
  ## Overview
4
- KL-distilled Qwen3-0.6B draft model for speculative decoding in TokForge. Trained on 10,000 teacher samples from Qwen3-8B, achieving **+41% decode speed** on SM8850 (S26 Ultra).
5
 
6
  ## What This Is
7
  A small (0.6B) draft model that:
@@ -10,14 +23,15 @@ A small (0.6B) draft model that:
10
  - KL distillation matches the teacher's full logit distribution, not just top-1 tokens
11
  - Results in significantly higher acceptance rates than stock or abliterated drafts
12
 
13
- ## Performance (Samsung S26 Ultra, SM8850)
14
 
15
- | Draft Model | Qwen3-8B tok/s | vs AR Baseline | Stability |
16
- |---|---|---|---|
17
- | No draft (AR) | 11.57 | — | Stable |
18
- | Stock 0.6B | 8.78 | -24% | Unstable |
19
- | KL v1 (1K samples) | 13.50 | +17% | Stable |
20
- | **KL v2 (10K samples)** | **16.37** | **+41%** | **Very stable** |
 
21
 
22
  ## Why KL Distillation?
23
 
@@ -43,8 +57,9 @@ KL divergence loss teaches the draft model to match the teacher's probability di
43
 
44
  ## Optimal Draft Config
45
 
46
- The draft model performs best with `thread_num: 2` and `power: high`:
47
 
 
48
  ```json
49
  {
50
  "backend_type": "cpu",
@@ -59,19 +74,19 @@ The draft model performs best with `thread_num: 2` and `power: high`:
59
  **Why thread_num: 2?** On Android's WALT CPU governor, using too many threads (4+) can cause the scheduler to spread work across efficiency cores at low frequency. 2 threads stay on performance cores at high clock speeds.
60
 
61
  ## Compatible Target Models
62
- - Qwen3-4B (MNN)
63
- - Qwen3-8B (MNN) — primary test target
64
- - Qwen3-14B (MNN)
65
 
66
- **NOT compatible** with Qwen3.5 models (different architecture: LinearAttention vs full MHA).
 
 
 
67
 
68
  ## SoC Compatibility
69
 
70
  | SoC | GPU | Uplift | Notes |
71
  |---|---|---|---|
72
- | SM8850 (S26 Ultra) | Adreno 840 | **+41%** | Primary target |
 
73
  | SM8650 (S24/Lenovo) | Adreno 750 | Testing | May regress on 9B |
74
- | SM8635 (Xiaomi) | Adreno 735 | Testing | Limited by RAM |
75
 
76
  ## Usage
77
  Download via TokForge app: Settings > Spec Decode > Download Acceleration Pack
@@ -82,7 +97,7 @@ Download via TokForge app: Settings > Spec Decode > Download Acceleration Pack
82
  |---|---|---|---|---|
83
  | v1 (abliterated) | — | — | +20% | 2026-03-19 |
84
  | v2 (KL 1K) | 1,024 | 0.43 | +17% | 2026-03-21 |
85
- | **v3 (KL 10K)** | **10,000** | **0.339** | **+41%** | **2026-03-21** |
86
 
87
  ---
88
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language: en
4
+ tags:
5
+ - mnn
6
+ - speculative-decoding
7
+ - draft-model
8
+ - qwen3
9
+ - tokforge
10
+ base_model: Qwen/Qwen3-0.6B
11
+ pipeline_tag: text-generation
12
+ ---
13
+
14
  # TokForge Acceleration Pack — Qwen3 KL-Distilled Draft Model
15
 
16
  ## Overview
17
+ KL-distilled Qwen3-0.6B draft model for speculative decoding in TokForge. Trained on 10,000 teacher samples from Qwen3-8B, achieving **+24-54% decode speed** across SM8850 and SM8635 devices (3-run averaged, 500-token DNS prose).
18
 
19
  ## What This Is
20
  A small (0.6B) draft model that:
 
23
  - KL distillation matches the teacher's full logit distribution, not just top-1 tokens
24
  - Results in significantly higher acceptance rates than stock or abliterated drafts
25
 
26
+ ## Benchmark Results (3-run averaged, 500-token DNS prose)
27
 
28
+ | Device | SoC | AR Baseline | With KL v2 Draft | Uplift |
29
+ |--------|-----|------------|-----------------|--------|
30
+ | RedMagic 11 Pro | SM8850 | 11.45 +/- 1.12 tok/s | 15.98 +/- 0.65 tok/s | **+39.5%** |
31
+ | Samsung S26 Ultra | SM8850 | 10.89 +/- 0.30 tok/s | 13.51 +/- 0.49 tok/s | **+24.0%** |
32
+ | Xiaomi Pad 7 Pro | SM8635 | 6.02 +/- 0.33 tok/s | 9.27 +/- 0.27 tok/s | **+54.0%** |
33
+
34
+ All numbers are 3-run averages on 500-token DNS prose with matched AR baselines.
35
 
36
  ## Why KL Distillation?
37
 
 
57
 
58
  ## Optimal Draft Config
59
 
60
+ Best with: target OpenCL, draft CPU, d=3, thread_num=2, power=high.
61
 
62
+ `config_cpu.json`:
63
  ```json
64
  {
65
  "backend_type": "cpu",
 
74
  **Why thread_num: 2?** On Android's WALT CPU governor, using too many threads (4+) can cause the scheduler to spread work across efficiency cores at low frequency. 2 threads stay on performance cores at high clock speeds.
75
 
76
  ## Compatible Target Models
 
 
 
77
 
78
+ - **Qwen3-8B**: +24-40% uplift
79
+ - **Qwen3-14B**: +40-70% uplift
80
+ - **Qwen3-4B**: Disabled (degenerates — KL trained from 8B teacher)
81
+ - **Qwen3.5**: Not compatible (different architecture: LinearAttention vs full MHA)
82
 
83
  ## SoC Compatibility
84
 
85
  | SoC | GPU | Uplift | Notes |
86
  |---|---|---|---|
87
+ | SM8850 (RedMagic/S26) | Adreno 840 | **+24-40%** | Primary targets |
88
+ | SM8635 (Xiaomi Pad 7 Pro) | Adreno 735 | **+54%** | Best relative uplift |
89
  | SM8650 (S24/Lenovo) | Adreno 750 | Testing | May regress on 9B |
 
90
 
91
  ## Usage
92
  Download via TokForge app: Settings > Spec Decode > Download Acceleration Pack
 
97
  |---|---|---|---|---|
98
  | v1 (abliterated) | — | — | +20% | 2026-03-19 |
99
  | v2 (KL 1K) | 1,024 | 0.43 | +17% | 2026-03-21 |
100
+ | **v3 (KL 10K)** | **10,000** | **0.339** | **+24-54%** | **2026-03-22** |
101
 
102
  ---
103