v3: KL-distilled 0.6B from Qwen3-8B (10K samples, KL=0.339, +41% uplift on SM8850)

Browse files

Files changed (8) hide show

README.md +75 -30
config.json +1 -1
config_cpu.json +4 -3
draft_config_cpu.json +4 -3
llm.mnn +1 -1
llm.mnn.weight +2 -2
llm_config.json +5 -5
tokenizer.txt +0 -0

README.md CHANGED Viewed

@@ -1,46 +1,91 @@
----
-license: apache-2.0
-base_model: huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2
-tags:
-- mnn
-- abliterated
-- uncensored
-- draft-model
-- spec-decode
-- tokforge
----
-# TokForge Acceleration Pack — Qwen3 Draft Model
-Abliterated Qwen3-0.6B draft model for speculative decoding in [TokForge](https://tokforge.ai).
 ## What This Is
-A small (0.6B) abliterated model that runs on CPU alongside your main GPU model, predicting tokens in parallel. The main model batch-verifies predictions, accepting correct ones instantly — resulting in **+41% to +63% decode speed** on Qwen3 4B/8B/14B targets.
-## Performance (RedMagic SM8850)
-| Target Model | Baseline | With Draft | Uplift |
-|-------------|----------|------------|--------|
-| Qwen3-4B | 16.5 tok/s | 23.2 tok/s | **+41%** |
-| Qwen3-8B | 11.0 tok/s | 15.6 tok/s | **+50%** |
-| Qwen3-14B | 6.4 tok/s | 10.4 tok/s | **+63%** |
-## Why Abliterated?
-This draft model is abliterated (refusal behavior removed) for +5.5% better token acceptance rate compared to the censored version. It works equally well with both censored and uncensored target models — the abliterated draft simply predicts more accurately across all content types.
-## Compatible Target Models
-- All Qwen3 MNN models (4B, 8B, 14B) — censored or uncensored
-- **NOT compatible** with Qwen3.5 models (different tokenizer, 303K vs 495K vocab)
-For Qwen3.5 targets, use [TokForge-AccelerationPack-Qwen35-Draft](https://huggingface.co/darkmaniac7/TokForge-AccelerationPack-Qwen35-Draft).
-## Source
-Converted from [huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2](https://huggingface.co/huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2) using MNN 4-bit HQQ quantization (quant_block=64).
 ## Usage
-Download via TokForge app → Settings → Spec Decode → Download Acceleration Pack.

+# TokForge Acceleration Pack — Qwen3 KL-Distilled Draft Model
+## Overview
+KL-distilled Qwen3-0.6B draft model for speculative decoding in TokForge. Trained on 10,000 teacher samples from Qwen3-8B, achieving **+41% decode speed** on SM8850 (S26 Ultra).
 ## What This Is
+A small (0.6B) draft model that:
+- Runs on CPU alongside your main GPU model
+- Predicts tokens in parallel — main model batch-verifies and accepts correct ones
+- KL distillation matches the teacher's full logit distribution, not just top-1 tokens
+- Results in significantly higher acceptance rates than stock or abliterated drafts
+## Performance (Samsung S26 Ultra, SM8850)
+| Draft Model | Qwen3-8B tok/s | vs AR Baseline | Stability |
+|---|---|---|---|
+| No draft (AR) | 11.57 | — | Stable |
+| Stock 0.6B | 8.78 | -24% | Unstable |
+| KL v1 (1K samples) | 13.50 | +17% | Stable |
+| **KL v2 (10K samples)** | **16.37** | **+41%** | **Very stable** |
+## Why KL Distillation?
+| Approach | How it trains | Result |
+|---|---|---|
+| Stock (no training) | Uses base weights | Low acceptance, often regresses |
+| Abliterated | Removes refusal behavior | +5.5% acceptance, still limited |
+| SFT (supervised) | Trains on hard labels (top-1 token) | Draft learns to copy text, not predict |
+| **KL Distillation** | **Trains on full logit distribution** | **Draft learns WHICH tokens are likely** |
+KL divergence loss teaches the draft model to match the teacher's probability distribution across ALL tokens, not just the most likely one. This is critical because MNN's greedy sampler needs the draft's top-1 to match the teacher's top-1 — and KL training optimizes exactly for this.
+## Training Details
+- **Teacher**: Qwen3-8B-HF (base)
+- **Student**: Qwen3-0.6B-HF (base, LoRA r=16, alpha=32)
+- **Data**: 10,000 teacher-generated samples (prose, code, Q&A)
+- **Loss**: 80% KL divergence + 20% cross-entropy, temperature=1.5
+- **Training**: 3 epochs, batch=4, grad_accum=4 (1,875 optimizer steps)
+- **Final KL**: 0.339 (21% lower than v1's 0.43 trained on 1K samples)
+- **Hardware**: 2x NVIDIA RTX PRO 6000 Blackwell (teacher GPU 0, student GPU 1)
+- **Export**: MNN Q4 quantization (quant_bit=4, quant_block=128)
+## Optimal Draft Config
+The draft model performs best with `thread_num: 2` and `power: high`:
+```json
+{
+    "backend_type": "cpu",
+    "thread_num": 2,
+    "precision": "low",
+    "memory": "low",
+    "sampler_type": "greedy",
+    "power": "high"
+}
+```
+**Why thread_num: 2?** On Android's WALT CPU governor, using too many threads (4+) can cause the scheduler to spread work across efficiency cores at low frequency. 2 threads stay on performance cores at high clock speeds.
+## Compatible Target Models
+- Qwen3-4B (MNN)
+- Qwen3-8B (MNN) — primary test target
+- Qwen3-14B (MNN)
+**NOT compatible** with Qwen3.5 models (different architecture: LinearAttention vs full MHA).
+## SoC Compatibility
+| SoC | GPU | Uplift | Notes |
+|---|---|---|---|
+| SM8850 (S26 Ultra) | Adreno 840 | **+41%** | Primary target |
+| SM8650 (S24/Lenovo) | Adreno 750 | Testing | May regress on 9B |
+| SM8635 (Xiaomi) | Adreno 735 | Testing | Limited by RAM |
 ## Usage
+Download via TokForge app: Settings > Spec Decode > Download Acceleration Pack
+## Version History
+| Version | Samples | KL | Uplift | Date |
+|---|---|---|---|---|
+| v1 (abliterated) | — | — | +20% | 2026-03-19 |
+| v2 (KL 1K) | 1,024 | 0.43 | +17% | 2026-03-21 |
+| **v3 (KL 10K)** | **10,000** | **0.339** | **+41%** | **2026-03-21** |
+---
+**License:** Apache 2.0
+**Source:** KL-distilled from [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) using [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) as teacher
+**Built with:** [TokForge](https://tokforge.ai)

config.json CHANGED Viewed

@@ -7,4 +7,4 @@
     "memory": "low",
     "sampler_type": "penalty",
     "penalty": 1.1
-}

     "memory": "low",
     "sampler_type": "penalty",
     "penalty": 1.1
+}

config_cpu.json CHANGED Viewed

@@ -2,8 +2,9 @@
     "llm_model": "llm.mnn",
     "llm_weight": "llm.mnn.weight",
     "backend_type": "cpu",
-    "thread_num": 1,
     "precision": "low",
-    "power": "high",
-    "sampler_type": "greedy"
 }

     "llm_model": "llm.mnn",
     "llm_weight": "llm.mnn.weight",
     "backend_type": "cpu",
+    "thread_num": 2,
     "precision": "low",
+    "memory": "low",
+    "sampler_type": "greedy",
+    "power": "high"
 }

draft_config_cpu.json CHANGED Viewed

@@ -2,8 +2,9 @@
     "llm_model": "llm.mnn",
     "llm_weight": "llm.mnn.weight",
     "backend_type": "cpu",
-    "thread_num": 1,
     "precision": "low",
-    "power": "high",
-    "sampler_type": "greedy"
 }

     "llm_model": "llm.mnn",
     "llm_weight": "llm.mnn.weight",
     "backend_type": "cpu",
+    "thread_num": 2,
     "precision": "low",
+    "memory": "low",
+    "sampler_type": "greedy",
+    "power": "high"
 }

llm.mnn CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9fda2182b1d37d5349c8834723d0d9d6f28a7edb16353f8c40240e435a670986
 size 503616

 version https://git-lfs.github.com/spec/v1
+oid sha256:4386b519389eacd94d2262c457f81776b87ad2742f886c38f5516097a7f15d30
 size 503616

llm.mnn.weight CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:70d766afcf050c83dcae506e122ca7d15ae352e6c6205c280351c8d9a5d48bd5
-size 373018866

 version https://git-lfs.github.com/spec/v1
+oid sha256:3f8042b4185570328c6ef5fac5a2aaf49f9c3d81b25693436d286b13f2a568f1
+size 335769842

llm_config.json CHANGED Viewed

@@ -5,14 +5,14 @@
     "attention_type": "full",
     "is_mrope": false,
     "jinja": {
-        "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0].role == 'system' %}\n        {{- messages[0].content + '\\n\\n' }}\n    {%- endif %}\n    {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0].role == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0].content + '<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n    {%- set index = (messages|length - 1) - loop.index0 %}\n    {%- if ns.multi_step_tool and message.role == \"user\" and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n        {%- set ns.multi_step_tool = false %}\n        {%- set ns.last_query_index = index %}\n    {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {%- set content = message.content %}\n        {%- set reasoning_content = '' %}\n        {%- if message.reasoning_content is defined and message.reasoning_content is not none %}\n            {%- set reasoning_content = message.reasoning_content %}\n        {%- else %}\n            {%- if '</think>' in message.content %}\n                {%- set content = message.content.split('</think>')[-1].lstrip('\\n') %}\n                {%- set reasoning_content = message.content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n            {%- endif %}\n        {%- endif %}\n        {%- if loop.index0 > ns.last_query_index %}\n            {%- if loop.last or (not loop.last and reasoning_content) %}\n                {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content.strip('\\n') + '\\n</think>\\n\\n' + content.lstrip('\\n') }}\n            {%- else %}\n                {{- '<|im_start|>' + message.role + '\\n' + content }}\n            {%- endif %}\n        {%- else %}\n            {{- '<|im_start|>' + message.role + '\\n' + content }}\n        {%- endif %}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if (loop.first and content) or (not loop.first) %}\n                    {{- '\\n' }}\n                {%- endif %}\n                {%- if tool_call.function %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- '<tool_call>\\n{\"name\": \"' }}\n                {{- tool_call.name }}\n                {{- '\", \"arguments\": ' }}\n                {%- if tool_call.arguments is string %}\n                    {{- tool_call.arguments }}\n                {%- else %}\n                    {{- tool_call.arguments | tojson }}\n                {%- endif %}\n                {{- '}\\n</tool_call>' }}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n    {%- if enable_thinking is defined and enable_thinking is false %}\n        {{- '<think>\\n\\n</think>\\n\\n' }}\n    {%- endif %}\n{%- endif %}",
         "eos": "<|im_end|>"
     },
     "tie_embeddings": [
-        275779826,
-        353571058,
-        19447808,
         4,
-        64
     ]
 }

     "attention_type": "full",
     "is_mrope": false,
     "jinja": {
+        "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0].role == 'system' %}\n        {{- messages[0].content + '\\n\\n' }}\n    {%- endif %}\n    {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0].role == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0].content + '<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n    {%- set index = (messages|length - 1) - loop.index0 %}\n    {%- if ns.multi_step_tool and message.role == \"user\" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n        {%- set ns.multi_step_tool = false %}\n        {%- set ns.last_query_index = index %}\n    {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n    {%- if message.content is string %}\n        {%- set content = message.content %}\n    {%- else %}\n        {%- set content = '' %}\n    {%- endif %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n        {{- '<|im_start|>' + message.role + '\\n' + content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {%- set reasoning_content = '' %}\n        {%- if message.reasoning_content is string %}\n            {%- set reasoning_content = message.reasoning_content %}\n        {%- else %}\n            {%- if '</think>' in content %}\n                {%- set reasoning_content = content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n                {%- set content = content.split('</think>')[-1].lstrip('\\n') %}\n            {%- endif %}\n        {%- endif %}\n        {%- if loop.index0 > ns.last_query_index %}\n            {%- if loop.last or (not loop.last and reasoning_content) %}\n                {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content.strip('\\n') + '\\n</think>\\n\\n' + content.lstrip('\\n') }}\n            {%- else %}\n                {{- '<|im_start|>' + message.role + '\\n' + content }}\n            {%- endif %}\n        {%- else %}\n            {{- '<|im_start|>' + message.role + '\\n' + content }}\n        {%- endif %}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if (loop.first and content) or (not loop.first) %}\n                    {{- '\\n' }}\n                {%- endif %}\n                {%- if tool_call.function %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- '<tool_call>\\n{\"name\": \"' }}\n                {{- tool_call.name }}\n                {{- '\", \"arguments\": ' }}\n                {%- if tool_call.arguments is string %}\n                    {{- tool_call.arguments }}\n                {%- else %}\n                    {{- tool_call.arguments | tojson }}\n                {%- endif %}\n                {{- '}\\n</tool_call>' }}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n    {%- if enable_thinking is defined and enable_thinking is false %}\n        {{- '<think>\\n\\n</think>\\n\\n' }}\n    {%- endif %}\n{%- endif %}",
         "eos": "<|im_end|>"
     },
     "tie_embeddings": [
+        248254706,
+        326045938,
+        9723904,
         4,
+        128
     ]
 }

tokenizer.txt CHANGED Viewed

The diff for this file is too large to render. See raw diff