v3: KL-distilled 0.6B from Qwen3-8B (10K samples, KL=0.339, +41% uplift on SM8850)
Browse files- README.md +75 -30
- config.json +1 -1
- config_cpu.json +4 -3
- draft_config_cpu.json +4 -3
- llm.mnn +1 -1
- llm.mnn.weight +2 -2
- llm_config.json +5 -5
- tokenizer.txt +0 -0
README.md
CHANGED
|
@@ -1,46 +1,91 @@
|
|
| 1 |
-
-
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
base_model: huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2
|
| 4 |
-
tags:
|
| 5 |
-
- mnn
|
| 6 |
-
- abliterated
|
| 7 |
-
- uncensored
|
| 8 |
-
- draft-model
|
| 9 |
-
- spec-decode
|
| 10 |
-
- tokforge
|
| 11 |
-
---
|
| 12 |
|
| 13 |
-
#
|
| 14 |
-
|
| 15 |
-
Abliterated Qwen3-0.6B draft model for speculative decoding in [TokForge](https://tokforge.ai).
|
| 16 |
|
| 17 |
## What This Is
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|-------------|----------|------------|--------|
|
| 25 |
-
| Qwen3-4B | 16.5 tok/s | 23.2 tok/s | **+41%** |
|
| 26 |
-
| Qwen3-8B | 11.0 tok/s | 15.6 tok/s | **+50%** |
|
| 27 |
-
| Qwen3-14B | 6.4 tok/s | 10.4 tok/s | **+63%** |
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
## Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
|
|
|
|
|
|
|
|
| 1 |
+
# TokForge Acceleration Pack — Qwen3 KL-Distilled Draft Model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
## Overview
|
| 4 |
+
KL-distilled Qwen3-0.6B draft model for speculative decoding in TokForge. Trained on 10,000 teacher samples from Qwen3-8B, achieving **+41% decode speed** on SM8850 (S26 Ultra).
|
|
|
|
| 5 |
|
| 6 |
## What This Is
|
| 7 |
+
A small (0.6B) draft model that:
|
| 8 |
+
- Runs on CPU alongside your main GPU model
|
| 9 |
+
- Predicts tokens in parallel — main model batch-verifies and accepts correct ones
|
| 10 |
+
- KL distillation matches the teacher's full logit distribution, not just top-1 tokens
|
| 11 |
+
- Results in significantly higher acceptance rates than stock or abliterated drafts
|
| 12 |
|
| 13 |
+
## Performance (Samsung S26 Ultra, SM8850)
|
| 14 |
|
| 15 |
+
| Draft Model | Qwen3-8B tok/s | vs AR Baseline | Stability |
|
| 16 |
+
|---|---|---|---|
|
| 17 |
+
| No draft (AR) | 11.57 | — | Stable |
|
| 18 |
+
| Stock 0.6B | 8.78 | -24% | Unstable |
|
| 19 |
+
| KL v1 (1K samples) | 13.50 | +17% | Stable |
|
| 20 |
+
| **KL v2 (10K samples)** | **16.37** | **+41%** | **Very stable** |
|
| 21 |
|
| 22 |
+
## Why KL Distillation?
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+
| Approach | How it trains | Result |
|
| 25 |
+
|---|---|---|
|
| 26 |
+
| Stock (no training) | Uses base weights | Low acceptance, often regresses |
|
| 27 |
+
| Abliterated | Removes refusal behavior | +5.5% acceptance, still limited |
|
| 28 |
+
| SFT (supervised) | Trains on hard labels (top-1 token) | Draft learns to copy text, not predict |
|
| 29 |
+
| **KL Distillation** | **Trains on full logit distribution** | **Draft learns WHICH tokens are likely** |
|
| 30 |
|
| 31 |
+
KL divergence loss teaches the draft model to match the teacher's probability distribution across ALL tokens, not just the most likely one. This is critical because MNN's greedy sampler needs the draft's top-1 to match the teacher's top-1 — and KL training optimizes exactly for this.
|
| 32 |
|
| 33 |
+
## Training Details
|
| 34 |
+
|
| 35 |
+
- **Teacher**: Qwen3-8B-HF (base)
|
| 36 |
+
- **Student**: Qwen3-0.6B-HF (base, LoRA r=16, alpha=32)
|
| 37 |
+
- **Data**: 10,000 teacher-generated samples (prose, code, Q&A)
|
| 38 |
+
- **Loss**: 80% KL divergence + 20% cross-entropy, temperature=1.5
|
| 39 |
+
- **Training**: 3 epochs, batch=4, grad_accum=4 (1,875 optimizer steps)
|
| 40 |
+
- **Final KL**: 0.339 (21% lower than v1's 0.43 trained on 1K samples)
|
| 41 |
+
- **Hardware**: 2x NVIDIA RTX PRO 6000 Blackwell (teacher GPU 0, student GPU 1)
|
| 42 |
+
- **Export**: MNN Q4 quantization (quant_bit=4, quant_block=128)
|
| 43 |
+
|
| 44 |
+
## Optimal Draft Config
|
| 45 |
+
|
| 46 |
+
The draft model performs best with `thread_num: 2` and `power: high`:
|
| 47 |
|
| 48 |
+
```json
|
| 49 |
+
{
|
| 50 |
+
"backend_type": "cpu",
|
| 51 |
+
"thread_num": 2,
|
| 52 |
+
"precision": "low",
|
| 53 |
+
"memory": "low",
|
| 54 |
+
"sampler_type": "greedy",
|
| 55 |
+
"power": "high"
|
| 56 |
+
}
|
| 57 |
+
```
|
| 58 |
|
| 59 |
+
**Why thread_num: 2?** On Android's WALT CPU governor, using too many threads (4+) can cause the scheduler to spread work across efficiency cores at low frequency. 2 threads stay on performance cores at high clock speeds.
|
| 60 |
|
| 61 |
+
## Compatible Target Models
|
| 62 |
+
- Qwen3-4B (MNN)
|
| 63 |
+
- Qwen3-8B (MNN) — primary test target
|
| 64 |
+
- Qwen3-14B (MNN)
|
| 65 |
+
|
| 66 |
+
**NOT compatible** with Qwen3.5 models (different architecture: LinearAttention vs full MHA).
|
| 67 |
|
| 68 |
+
## SoC Compatibility
|
| 69 |
+
|
| 70 |
+
| SoC | GPU | Uplift | Notes |
|
| 71 |
+
|---|---|---|---|
|
| 72 |
+
| SM8850 (S26 Ultra) | Adreno 840 | **+41%** | Primary target |
|
| 73 |
+
| SM8650 (S24/Lenovo) | Adreno 750 | Testing | May regress on 9B |
|
| 74 |
+
| SM8635 (Xiaomi) | Adreno 735 | Testing | Limited by RAM |
|
| 75 |
|
| 76 |
## Usage
|
| 77 |
+
Download via TokForge app: Settings > Spec Decode > Download Acceleration Pack
|
| 78 |
+
|
| 79 |
+
## Version History
|
| 80 |
+
|
| 81 |
+
| Version | Samples | KL | Uplift | Date |
|
| 82 |
+
|---|---|---|---|---|
|
| 83 |
+
| v1 (abliterated) | — | — | +20% | 2026-03-19 |
|
| 84 |
+
| v2 (KL 1K) | 1,024 | 0.43 | +17% | 2026-03-21 |
|
| 85 |
+
| **v3 (KL 10K)** | **10,000** | **0.339** | **+41%** | **2026-03-21** |
|
| 86 |
+
|
| 87 |
+
---
|
| 88 |
|
| 89 |
+
**License:** Apache 2.0
|
| 90 |
+
**Source:** KL-distilled from [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) using [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) as teacher
|
| 91 |
+
**Built with:** [TokForge](https://tokforge.ai)
|
config.json
CHANGED
|
@@ -7,4 +7,4 @@
|
|
| 7 |
"memory": "low",
|
| 8 |
"sampler_type": "penalty",
|
| 9 |
"penalty": 1.1
|
| 10 |
-
}
|
|
|
|
| 7 |
"memory": "low",
|
| 8 |
"sampler_type": "penalty",
|
| 9 |
"penalty": 1.1
|
| 10 |
+
}
|
config_cpu.json
CHANGED
|
@@ -2,8 +2,9 @@
|
|
| 2 |
"llm_model": "llm.mnn",
|
| 3 |
"llm_weight": "llm.mnn.weight",
|
| 4 |
"backend_type": "cpu",
|
| 5 |
-
"thread_num":
|
| 6 |
"precision": "low",
|
| 7 |
-
"
|
| 8 |
-
"sampler_type": "greedy"
|
|
|
|
| 9 |
}
|
|
|
|
| 2 |
"llm_model": "llm.mnn",
|
| 3 |
"llm_weight": "llm.mnn.weight",
|
| 4 |
"backend_type": "cpu",
|
| 5 |
+
"thread_num": 2,
|
| 6 |
"precision": "low",
|
| 7 |
+
"memory": "low",
|
| 8 |
+
"sampler_type": "greedy",
|
| 9 |
+
"power": "high"
|
| 10 |
}
|
draft_config_cpu.json
CHANGED
|
@@ -2,8 +2,9 @@
|
|
| 2 |
"llm_model": "llm.mnn",
|
| 3 |
"llm_weight": "llm.mnn.weight",
|
| 4 |
"backend_type": "cpu",
|
| 5 |
-
"thread_num":
|
| 6 |
"precision": "low",
|
| 7 |
-
"
|
| 8 |
-
"sampler_type": "greedy"
|
|
|
|
| 9 |
}
|
|
|
|
| 2 |
"llm_model": "llm.mnn",
|
| 3 |
"llm_weight": "llm.mnn.weight",
|
| 4 |
"backend_type": "cpu",
|
| 5 |
+
"thread_num": 2,
|
| 6 |
"precision": "low",
|
| 7 |
+
"memory": "low",
|
| 8 |
+
"sampler_type": "greedy",
|
| 9 |
+
"power": "high"
|
| 10 |
}
|
llm.mnn
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 503616
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4386b519389eacd94d2262c457f81776b87ad2742f886c38f5516097a7f15d30
|
| 3 |
size 503616
|
llm.mnn.weight
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3f8042b4185570328c6ef5fac5a2aaf49f9c3d81b25693436d286b13f2a568f1
|
| 3 |
+
size 335769842
|
llm_config.json
CHANGED
|
@@ -5,14 +5,14 @@
|
|
| 5 |
"attention_type": "full",
|
| 6 |
"is_mrope": false,
|
| 7 |
"jinja": {
|
| 8 |
-
"chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0].role == 'system' %}\n {{- messages[0].content + '\\n\\n' }}\n {%- endif %}\n {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0].role == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0].content + '<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n {%- set index = (messages|length - 1) - loop.index0 %}\n {%- if ns.multi_step_tool and message.role == \"user\" and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n {%- set ns.multi_step_tool = false %}\n {%- set ns.last_query_index = index %}\n {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n {{- '<|im_start|>' + message.role + '\\n' +
|
| 9 |
"eos": "<|im_end|>"
|
| 10 |
},
|
| 11 |
"tie_embeddings": [
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
4,
|
| 16 |
-
|
| 17 |
]
|
| 18 |
}
|
|
|
|
| 5 |
"attention_type": "full",
|
| 6 |
"is_mrope": false,
|
| 7 |
"jinja": {
|
| 8 |
+
"chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0].role == 'system' %}\n {{- messages[0].content + '\\n\\n' }}\n {%- endif %}\n {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0].role == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0].content + '<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n {%- set index = (messages|length - 1) - loop.index0 %}\n {%- if ns.multi_step_tool and message.role == \"user\" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n {%- set ns.multi_step_tool = false %}\n {%- set ns.last_query_index = index %}\n {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n {%- if message.content is string %}\n {%- set content = message.content %}\n {%- else %}\n {%- set content = '' %}\n {%- endif %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n {{- '<|im_start|>' + message.role + '\\n' + content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {%- set reasoning_content = '' %}\n {%- if message.reasoning_content is string %}\n {%- set reasoning_content = message.reasoning_content %}\n {%- else %}\n {%- if '</think>' in content %}\n {%- set reasoning_content = content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n {%- set content = content.split('</think>')[-1].lstrip('\\n') %}\n {%- endif %}\n {%- endif %}\n {%- if loop.index0 > ns.last_query_index %}\n {%- if loop.last or (not loop.last and reasoning_content) %}\n {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content.strip('\\n') + '\\n</think>\\n\\n' + content.lstrip('\\n') }}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- endif %}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- endif %}\n {%- if message.tool_calls %}\n {%- for tool_call in message.tool_calls %}\n {%- if (loop.first and content) or (not loop.first) %}\n {{- '\\n' }}\n {%- endif %}\n {%- if tool_call.function %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {%- if tool_call.arguments is string %}\n {{- tool_call.arguments }}\n {%- else %}\n {{- tool_call.arguments | tojson }}\n {%- endif %}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {%- endif %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n {%- if enable_thinking is defined and enable_thinking is false %}\n {{- '<think>\\n\\n</think>\\n\\n' }}\n {%- endif %}\n{%- endif %}",
|
| 9 |
"eos": "<|im_end|>"
|
| 10 |
},
|
| 11 |
"tie_embeddings": [
|
| 12 |
+
248254706,
|
| 13 |
+
326045938,
|
| 14 |
+
9723904,
|
| 15 |
4,
|
| 16 |
+
128
|
| 17 |
]
|
| 18 |
}
|
tokenizer.txt
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|