File size: 5,000 Bytes
2d9453c
 
 
 
 
 
 
 
 
 
 
 
 
1f8c190
11595f5
1f8c190
2d9453c
11595f5
51a33fe
1f8c190
 
 
 
 
11595f5
2d9453c
11595f5
2d9453c
 
 
 
 
 
 
11595f5
1f8c190
11595f5
1f8c190
 
 
 
 
 
11595f5
1f8c190
11595f5
1f8c190
 
 
 
 
 
 
 
 
 
 
 
 
2d9453c
11595f5
2d9453c
1f8c190
 
 
 
 
 
 
 
 
 
11595f5
1f8c190
11595f5
1f8c190
 
2d9453c
 
 
 
11595f5
1f8c190
 
 
 
2d9453c
 
1f8c190
11595f5
51a33fe
1f8c190
 
994747a
 
 
 
 
 
 
 
 
1f8c190
 
 
 
 
 
2d9453c
1f8c190
 
11595f5
1f8c190
 
 
994747a
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
license: apache-2.0
language: en
tags:
  - mnn
  - speculative-decoding
  - draft-model
  - qwen3
  - tokforge
base_model: Qwen/Qwen3-0.6B
pipeline_tag: text-generation
---

# TokForge Acceleration Pack — Qwen3 KL-Distilled Draft Model

## Overview
KL-distilled Qwen3-0.6B draft model for speculative decoding in TokForge. Trained on 10,000 teacher samples from Qwen3-8B, achieving **+24-54% decode speed** across SM8850 and SM8635 devices (3-run averaged, 500-token DNS prose).

## What This Is
A small (0.6B) draft model that:
- Runs on CPU alongside your main GPU model
- Predicts tokens in parallel — main model batch-verifies and accepts correct ones
- KL distillation matches the teacher's full logit distribution, not just top-1 tokens
- Results in significantly higher acceptance rates than stock or abliterated drafts

## Benchmark Results (3-run averaged, 500-token DNS prose)

| Device | SoC | AR Baseline | With KL v2 Draft | Uplift |
|--------|-----|------------|-----------------|--------|
| RedMagic 11 Pro | SM8850 | 11.45 +/- 1.12 tok/s | 15.98 +/- 0.65 tok/s | **+39.5%** |
| Samsung S26 Ultra | SM8850 | 10.89 +/- 0.30 tok/s | 13.51 +/- 0.49 tok/s | **+24.0%** |
| Xiaomi Pad 7 Pro | SM8635 | 6.02 +/- 0.33 tok/s | 9.27 +/- 0.27 tok/s | **+54.0%** |

All numbers are 3-run averages on 500-token DNS prose with matched AR baselines.

## Why KL Distillation?

| Approach | How it trains | Result |
|---|---|---|
| Stock (no training) | Uses base weights | Low acceptance, often regresses |
| Abliterated | Removes refusal behavior | +5.5% acceptance, still limited |
| SFT (supervised) | Trains on hard labels (top-1 token) | Draft learns to copy text, not predict |
| **KL Distillation** | **Trains on full logit distribution** | **Draft learns WHICH tokens are likely** |

KL divergence loss teaches the draft model to match the teacher's probability distribution across ALL tokens, not just the most likely one. This is critical because MNN's greedy sampler needs the draft's top-1 to match the teacher's top-1 — and KL training optimizes exactly for this.

## Training Details

- **Teacher**: Qwen3-8B-HF (base)
- **Student**: Qwen3-0.6B-HF (base, LoRA r=16, alpha=32)
- **Data**: 10,000 teacher-generated samples (prose, code, Q&A)
- **Loss**: 80% KL divergence + 20% cross-entropy, temperature=1.5
- **Training**: 3 epochs, batch=4, grad_accum=4 (1,875 optimizer steps)
- **Final KL**: 0.339 (21% lower than v1's 0.43 trained on 1K samples)
- **Hardware**: 2x NVIDIA RTX PRO 6000 Blackwell (teacher GPU 0, student GPU 1)
- **Export**: MNN Q4 quantization (quant_bit=4, quant_block=128)

## Optimal Draft Config

Best with: target OpenCL, draft CPU, d=3, thread_num=2, power=high.

`config_cpu.json`:
```json
{
    "backend_type": "cpu",
    "thread_num": 2,
    "precision": "low",
    "memory": "low",
    "sampler_type": "greedy",
    "power": "high"
}
```

**Why thread_num: 2?** On Android's WALT CPU governor, using too many threads (4+) can cause the scheduler to spread work across efficiency cores at low frequency. 2 threads stay on performance cores at high clock speeds.

## Compatible Target Models

- **Qwen3-8B**: +24-40% uplift
- **Qwen3-14B**: +40-70% uplift
- **Qwen3-4B**: Disabled (degenerates — KL trained from 8B teacher)
- **Qwen3.5**: Not compatible (different architecture: LinearAttention vs full MHA)

## SoC Compatibility

| SoC | GPU | Uplift | Notes |
|---|---|---|---|
| SM8850 (RedMagic/S26) | Adreno 840 | **+24-40%** | Primary targets |
| SM8635 (Xiaomi Pad 7 Pro) | Adreno 735 | **+54%** | Best relative uplift |
| SM8650 (S24/Lenovo) | Adreno 750 | Testing | May regress on 9B |

## Usage
Download via TokForge app: Settings > Spec Decode > Download Acceleration Pack

### Typical TokForge recipe

- target backend: `opencl`
- draft backend: `cpu`
- draft threads: `2`
- `d=3`
- greedy sampling
- low precision / low memory

## Version History

| Version | Samples | KL | Uplift | Date |
|---|---|---|---|---|
| v1 (abliterated) | — | — | +20% | 2026-03-19 |
| v2 (KL 1K) | 1,024 | 0.43 | +17% | 2026-03-21 |
| **v3 (KL 10K)** | **10,000** | **0.339** | **+24-54%** | **2026-03-22** |

---

**License:** Apache 2.0
**Source:** KL-distilled from [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) using [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) as teacher
**Built with:** [TokForge](https://tokforge.ai)

## Limitations and Intended Use

- Best current evidence is strongest on `Qwen3-8B` and `Qwen3-14B`.
- `Qwen3-4B` was not a strong lane in our preserved results.
- Device uplift depends heavily on backend routing, prompt length, and thermal state.
- This is a TokForge-specific runtime bundle, not a standard Transformers checkpoint.

## Community

- Website: [tokforge.ai](https://tokforge.ai)
- Discord: [Join the Discord](https://discord.gg/Acv3CBtfVm)
- Collection: [TokForge Mobile Draft Models](https://huggingface.co/collections/darkmaniac7/tokforge-mobile-draft-models-69c36153ea7084ce78329665)