File size: 5,000 Bytes

---
license: apache-2.0
language: en
tags:
  - mnn
  - speculative-decoding
  - draft-model
  - qwen3
  - tokforge
base_model: Qwen/Qwen3-0.6B
pipeline_tag: text-generation
---

# TokForge Acceleration Pack — Qwen3 KL-Distilled Draft Model

## Overview
KL-distilled Qwen3-0.6B draft model for speculative decoding in TokForge. Trained on 10,000 teacher samples from Qwen3-8B, achieving **+24-54% decode speed** across SM8850 and SM8635 devices (3-run averaged, 500-token DNS prose).

## What This Is
A small (0.6B) draft model that:
- Runs on CPU alongside your main GPU model
- Predicts tokens in parallel — main model batch-verifies and accepts correct ones
- KL distillation matches the teacher's full logit distribution, not just top-1 tokens
- Results in significantly higher acceptance rates than stock or abliterated drafts

## Benchmark Results (3-run averaged, 500-token DNS prose)

| Device | SoC | AR Baseline | With KL v2 Draft | Uplift |
|--------|-----|------------|-----------------|--------|
| RedMagic 11 Pro | SM8850 | 11.45 +/- 1.12 tok/s | 15.98 +/- 0.65 tok/s | **+39.5%** |
| Samsung S26 Ultra | SM8850 | 10.89 +/- 0.30 tok/s | 13.51 +/- 0.49 tok/s | **+24.0%** |
| Xiaomi Pad 7 Pro | SM8635 | 6.02 +/- 0.33 tok/s | 9.27 +/- 0.27 tok/s | **+54.0%** |

All numbers are 3-run averages on 500-token DNS prose with matched AR baselines.

## Why KL Distillation?

| Approach | How it trains | Result |
|---|---|---|
| Stock (no training) | Uses base weights | Low acceptance, often regresses |
| Abliterated | Removes refusal behavior | +5.5% acceptance, still limited |
| SFT (supervised) | Trains on hard labels (top-1 token) | Draft learns to copy text, not predict |
| **KL Distillation** | **Trains on full logit distribution** | **Draft learns WHICH tokens are likely** |

KL divergence loss teaches the draft model to match the teacher's probability distribution across ALL tokens, not just the most likely one. This is critical because MNN's greedy sampler needs the draft's top-1 to match the teacher's top-1 — and KL training optimizes exactly for this.

## Training Details

- **Teacher**: Qwen3-8B-HF (base)
- **Student**: Qwen3-0.6B-HF (base, LoRA r=16, alpha=32)
- **Data**: 10,000 teacher-generated samples (prose, code, Q&A)
- **Loss**: 80% KL divergence + 20% cross-entropy, temperature=1.5
- **Training**: 3 epochs, batch=4, grad_accum=4 (1,875 optimizer steps)
- **Final KL**: 0.339 (21% lower than v1's 0.43 trained on 1K samples)
- **Hardware**: 2x NVIDIA RTX PRO 6000 Blackwell (teacher GPU 0, student GPU 1)
- **Export**: MNN Q4 quantization (quant_bit=4, quant_block=128)

## Optimal Draft Config

Best with: target OpenCL, draft CPU, d=3, thread_num=2, power=high.

`config_cpu.json`:
```json
{
    "backend_type": "cpu",
    "thread_num": 2,
    "precision": "low",
    "memory": "low",
    "sampler_type": "greedy",
    "power": "high"
}
```

**Why thread_num: 2?** On Android's WALT CPU governor, using too many threads (4+) can cause the scheduler to spread work across efficiency cores at low frequency. 2 threads stay on performance cores at high clock speeds.

## Compatible Target Models

- **Qwen3-8B**: +24-40% uplift
- **Qwen3-14B**: +40-70% uplift
- **Qwen3-4B**: Disabled (degenerates — KL trained from 8B teacher)
- **Qwen3.5**: Not compatible (different architecture: LinearAttention vs full MHA)

## SoC Compatibility

| SoC | GPU | Uplift | Notes |
|---|---|---|---|
| SM8850 (RedMagic/S26) | Adreno 840 | **+24-40%** | Primary targets |
| SM8635 (Xiaomi Pad 7 Pro) | Adreno 735 | **+54%** | Best relative uplift |
| SM8650 (S24/Lenovo) | Adreno 750 | Testing | May regress on 9B |

## Usage
Download via TokForge app: Settings > Spec Decode > Download Acceleration Pack

### Typical TokForge recipe

- target backend: `opencl`
- draft backend: `cpu`
- draft threads: `2`
- `d=3`
- greedy sampling
- low precision / low memory

## Version History

| Version | Samples | KL | Uplift | Date |
|---|---|---|---|---|
| v1 (abliterated) | — | — | +20% | 2026-03-19 |
| v2 (KL 1K) | 1,024 | 0.43 | +17% | 2026-03-21 |
| **v3 (KL 10K)** | **10,000** | **0.339** | **+24-54%** | **2026-03-22** |

---

**License:** Apache 2.0
**Source:** KL-distilled from [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) using [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) as teacher
**Built with:** [TokForge](https://tokforge.ai)

## Limitations and Intended Use

- Best current evidence is strongest on `Qwen3-8B` and `Qwen3-14B`.
- `Qwen3-4B` was not a strong lane in our preserved results.
- Device uplift depends heavily on backend routing, prompt length, and thermal state.
- This is a TokForge-specific runtime bundle, not a standard Transformers checkpoint.

## Community

- Website: [tokforge.ai](https://tokforge.ai)
- Discord: [Join the Discord](https://discord.gg/Acv3CBtfVm)
- Collection: [TokForge Mobile Draft Models](https://huggingface.co/collections/darkmaniac7/tokforge-mobile-draft-models-69c36153ea7084ce78329665)