darkmaniac7
/

TokForge-AccelerationPack-Draft

@@ -1,7 +1,20 @@
 # TokForge Acceleration Pack — Qwen3 KL-Distilled Draft Model
 ## Overview
-KL-distilled Qwen3-0.6B draft model for speculative decoding in TokForge. Trained on 10,000 teacher samples from Qwen3-8B, achieving **+41% decode speed** on SM8850 (S26 Ultra).
 ## What This Is
 A small (0.6B) draft model that:
@@ -10,14 +23,15 @@ A small (0.6B) draft model that:
 - KL distillation matches the teacher's full logit distribution, not just top-1 tokens
 - Results in significantly higher acceptance rates than stock or abliterated drafts
-## Performance (Samsung S26 Ultra, SM8850)
-| Draft Model | Qwen3-8B tok/s | vs AR Baseline | Stability |
-|---|---|---|---|
-| No draft (AR) | 11.57 | — | Stable |
-| Stock 0.6B | 8.78 | -24% | Unstable |
-| KL v1 (1K samples) | 13.50 | +17% | Stable |
-| **KL v2 (10K samples)** | **16.37** | **+41%** | **Very stable** |
 ## Why KL Distillation?
@@ -43,8 +57,9 @@ KL divergence loss teaches the draft model to match the teacher's probability di
 ## Optimal Draft Config
-The draft model performs best with `thread_num: 2` and `power: high`:
 ```json
 {
     "backend_type": "cpu",
@@ -59,19 +74,19 @@ The draft model performs best with `thread_num: 2` and `power: high`:
 **Why thread_num: 2?** On Android's WALT CPU governor, using too many threads (4+) can cause the scheduler to spread work across efficiency cores at low frequency. 2 threads stay on performance cores at high clock speeds.
 ## Compatible Target Models
-- Qwen3-4B (MNN)
-- Qwen3-8B (MNN) — primary test target
-- Qwen3-14B (MNN)
-**NOT compatible** with Qwen3.5 models (different architecture: LinearAttention vs full MHA).
 ## SoC Compatibility
 | SoC | GPU | Uplift | Notes |
 |---|---|---|---|
-| SM8850 (S26 Ultra) | Adreno 840 | **+41%** | Primary target |
 | SM8650 (S24/Lenovo) | Adreno 750 | Testing | May regress on 9B |
-| SM8635 (Xiaomi) | Adreno 735 | Testing | Limited by RAM |
 ## Usage
 Download via TokForge app: Settings > Spec Decode > Download Acceleration Pack
@@ -82,7 +97,7 @@ Download via TokForge app: Settings > Spec Decode > Download Acceleration Pack
 |---|---|---|---|---|
 | v1 (abliterated) | — | — | +20% | 2026-03-19 |
 | v2 (KL 1K) | 1,024 | 0.43 | +17% | 2026-03-21 |
-| **v3 (KL 10K)** | **10,000** | **0.339** | **+41%** | **2026-03-21** |
 ---

+---
+license: apache-2.0
+language: en
+tags:
+  - mnn
+  - speculative-decoding
+  - draft-model
+  - qwen3
+  - tokforge
+base_model: Qwen/Qwen3-0.6B
+pipeline_tag: text-generation
+---
 # TokForge Acceleration Pack — Qwen3 KL-Distilled Draft Model
 ## Overview
+KL-distilled Qwen3-0.6B draft model for speculative decoding in TokForge. Trained on 10,000 teacher samples from Qwen3-8B, achieving **+24-54% decode speed** across SM8850 and SM8635 devices (3-run averaged, 500-token DNS prose).
 ## What This Is
 A small (0.6B) draft model that:
 - KL distillation matches the teacher's full logit distribution, not just top-1 tokens
 - Results in significantly higher acceptance rates than stock or abliterated drafts
+## Benchmark Results (3-run averaged, 500-token DNS prose)
+| Device | SoC | AR Baseline | With KL v2 Draft | Uplift |
+|--------|-----|------------|-----------------|--------|
+| RedMagic 11 Pro | SM8850 | 11.45 +/- 1.12 tok/s | 15.98 +/- 0.65 tok/s | **+39.5%** |
+| Samsung S26 Ultra | SM8850 | 10.89 +/- 0.30 tok/s | 13.51 +/- 0.49 tok/s | **+24.0%** |
+| Xiaomi Pad 7 Pro | SM8635 | 6.02 +/- 0.33 tok/s | 9.27 +/- 0.27 tok/s | **+54.0%** |
+All numbers are 3-run averages on 500-token DNS prose with matched AR baselines.
 ## Why KL Distillation?
 ## Optimal Draft Config
+Best with: target OpenCL, draft CPU, d=3, thread_num=2, power=high.
+`config_cpu.json`:
 ```json
 {
     "backend_type": "cpu",
 **Why thread_num: 2?** On Android's WALT CPU governor, using too many threads (4+) can cause the scheduler to spread work across efficiency cores at low frequency. 2 threads stay on performance cores at high clock speeds.
 ## Compatible Target Models
+- **Qwen3-8B**: +24-40% uplift
+- **Qwen3-14B**: +40-70% uplift
+- **Qwen3-4B**: Disabled (degenerates — KL trained from 8B teacher)
+- **Qwen3.5**: Not compatible (different architecture: LinearAttention vs full MHA)
 ## SoC Compatibility
 | SoC | GPU | Uplift | Notes |
 |---|---|---|---|
+| SM8850 (RedMagic/S26) | Adreno 840 | **+24-40%** | Primary targets |
+| SM8635 (Xiaomi Pad 7 Pro) | Adreno 735 | **+54%** | Best relative uplift |
 | SM8650 (S24/Lenovo) | Adreno 750 | Testing | May regress on 9B |
 ## Usage
 Download via TokForge app: Settings > Spec Decode > Download Acceleration Pack
 |---|---|---|---|---|
 | v1 (abliterated) | — | — | +20% | 2026-03-19 |
 | v2 (KL 1K) | 1,024 | 0.43 | +17% | 2026-03-21 |
+| **v3 (KL 10K)** | **10,000** | **0.339** | **+24-54%** | **2026-03-22** |
 ---