darkmaniac7 commited on
Commit
1f8c190
·
verified ·
1 Parent(s): 51a33fe

v3: KL-distilled 0.6B from Qwen3-8B (10K samples, KL=0.339, +41% uplift on SM8850)

Browse files
Files changed (8) hide show
  1. README.md +75 -30
  2. config.json +1 -1
  3. config_cpu.json +4 -3
  4. draft_config_cpu.json +4 -3
  5. llm.mnn +1 -1
  6. llm.mnn.weight +2 -2
  7. llm_config.json +5 -5
  8. tokenizer.txt +0 -0
README.md CHANGED
@@ -1,46 +1,91 @@
1
- ---
2
- license: apache-2.0
3
- base_model: huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2
4
- tags:
5
- - mnn
6
- - abliterated
7
- - uncensored
8
- - draft-model
9
- - spec-decode
10
- - tokforge
11
- ---
12
 
13
- # TokForge Acceleration Pack — Qwen3 Draft Model
14
-
15
- Abliterated Qwen3-0.6B draft model for speculative decoding in [TokForge](https://tokforge.ai).
16
 
17
  ## What This Is
 
 
 
 
 
18
 
19
- A small (0.6B) abliterated model that runs on CPU alongside your main GPU model, predicting tokens in parallel. The main model batch-verifies predictions, accepting correct ones instantly — resulting in **+41% to +63% decode speed** on Qwen3 4B/8B/14B targets.
20
 
21
- ## Performance (RedMagic SM8850)
 
 
 
 
 
22
 
23
- | Target Model | Baseline | With Draft | Uplift |
24
- |-------------|----------|------------|--------|
25
- | Qwen3-4B | 16.5 tok/s | 23.2 tok/s | **+41%** |
26
- | Qwen3-8B | 11.0 tok/s | 15.6 tok/s | **+50%** |
27
- | Qwen3-14B | 6.4 tok/s | 10.4 tok/s | **+63%** |
28
 
29
- ## Why Abliterated?
 
 
 
 
 
30
 
31
- This draft model is abliterated (refusal behavior removed) for +5.5% better token acceptance rate compared to the censored version. It works equally well with both censored and uncensored target models the abliterated draft simply predicts more accurately across all content types.
32
 
33
- ## Compatible Target Models
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
- - All Qwen3 MNN models (4B, 8B, 14B) — censored or uncensored
36
- - **NOT compatible** with Qwen3.5 models (different tokenizer, 303K vs 495K vocab)
 
 
 
 
 
 
 
 
37
 
38
- For Qwen3.5 targets, use [TokForge-AccelerationPack-Qwen35-Draft](https://huggingface.co/darkmaniac7/TokForge-AccelerationPack-Qwen35-Draft).
39
 
40
- ## Source
 
 
 
 
 
41
 
42
- Converted from [huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2](https://huggingface.co/huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2) using MNN 4-bit HQQ quantization (quant_block=64).
 
 
 
 
 
 
43
 
44
  ## Usage
 
 
 
 
 
 
 
 
 
 
 
45
 
46
- Download via TokForge app → Settings → Spec Decode → Download Acceleration Pack.
 
 
 
1
+ # TokForge Acceleration Pack — Qwen3 KL-Distilled Draft Model
 
 
 
 
 
 
 
 
 
 
2
 
3
+ ## Overview
4
+ KL-distilled Qwen3-0.6B draft model for speculative decoding in TokForge. Trained on 10,000 teacher samples from Qwen3-8B, achieving **+41% decode speed** on SM8850 (S26 Ultra).
 
5
 
6
  ## What This Is
7
+ A small (0.6B) draft model that:
8
+ - Runs on CPU alongside your main GPU model
9
+ - Predicts tokens in parallel — main model batch-verifies and accepts correct ones
10
+ - KL distillation matches the teacher's full logit distribution, not just top-1 tokens
11
+ - Results in significantly higher acceptance rates than stock or abliterated drafts
12
 
13
+ ## Performance (Samsung S26 Ultra, SM8850)
14
 
15
+ | Draft Model | Qwen3-8B tok/s | vs AR Baseline | Stability |
16
+ |---|---|---|---|
17
+ | No draft (AR) | 11.57 | — | Stable |
18
+ | Stock 0.6B | 8.78 | -24% | Unstable |
19
+ | KL v1 (1K samples) | 13.50 | +17% | Stable |
20
+ | **KL v2 (10K samples)** | **16.37** | **+41%** | **Very stable** |
21
 
22
+ ## Why KL Distillation?
 
 
 
 
23
 
24
+ | Approach | How it trains | Result |
25
+ |---|---|---|
26
+ | Stock (no training) | Uses base weights | Low acceptance, often regresses |
27
+ | Abliterated | Removes refusal behavior | +5.5% acceptance, still limited |
28
+ | SFT (supervised) | Trains on hard labels (top-1 token) | Draft learns to copy text, not predict |
29
+ | **KL Distillation** | **Trains on full logit distribution** | **Draft learns WHICH tokens are likely** |
30
 
31
+ KL divergence loss teaches the draft model to match the teacher's probability distribution across ALL tokens, not just the most likely one. This is critical because MNN's greedy sampler needs the draft's top-1 to match the teacher's top-1 and KL training optimizes exactly for this.
32
 
33
+ ## Training Details
34
+
35
+ - **Teacher**: Qwen3-8B-HF (base)
36
+ - **Student**: Qwen3-0.6B-HF (base, LoRA r=16, alpha=32)
37
+ - **Data**: 10,000 teacher-generated samples (prose, code, Q&A)
38
+ - **Loss**: 80% KL divergence + 20% cross-entropy, temperature=1.5
39
+ - **Training**: 3 epochs, batch=4, grad_accum=4 (1,875 optimizer steps)
40
+ - **Final KL**: 0.339 (21% lower than v1's 0.43 trained on 1K samples)
41
+ - **Hardware**: 2x NVIDIA RTX PRO 6000 Blackwell (teacher GPU 0, student GPU 1)
42
+ - **Export**: MNN Q4 quantization (quant_bit=4, quant_block=128)
43
+
44
+ ## Optimal Draft Config
45
+
46
+ The draft model performs best with `thread_num: 2` and `power: high`:
47
 
48
+ ```json
49
+ {
50
+ "backend_type": "cpu",
51
+ "thread_num": 2,
52
+ "precision": "low",
53
+ "memory": "low",
54
+ "sampler_type": "greedy",
55
+ "power": "high"
56
+ }
57
+ ```
58
 
59
+ **Why thread_num: 2?** On Android's WALT CPU governor, using too many threads (4+) can cause the scheduler to spread work across efficiency cores at low frequency. 2 threads stay on performance cores at high clock speeds.
60
 
61
+ ## Compatible Target Models
62
+ - Qwen3-4B (MNN)
63
+ - Qwen3-8B (MNN) — primary test target
64
+ - Qwen3-14B (MNN)
65
+
66
+ **NOT compatible** with Qwen3.5 models (different architecture: LinearAttention vs full MHA).
67
 
68
+ ## SoC Compatibility
69
+
70
+ | SoC | GPU | Uplift | Notes |
71
+ |---|---|---|---|
72
+ | SM8850 (S26 Ultra) | Adreno 840 | **+41%** | Primary target |
73
+ | SM8650 (S24/Lenovo) | Adreno 750 | Testing | May regress on 9B |
74
+ | SM8635 (Xiaomi) | Adreno 735 | Testing | Limited by RAM |
75
 
76
  ## Usage
77
+ Download via TokForge app: Settings > Spec Decode > Download Acceleration Pack
78
+
79
+ ## Version History
80
+
81
+ | Version | Samples | KL | Uplift | Date |
82
+ |---|---|---|---|---|
83
+ | v1 (abliterated) | — | — | +20% | 2026-03-19 |
84
+ | v2 (KL 1K) | 1,024 | 0.43 | +17% | 2026-03-21 |
85
+ | **v3 (KL 10K)** | **10,000** | **0.339** | **+41%** | **2026-03-21** |
86
+
87
+ ---
88
 
89
+ **License:** Apache 2.0
90
+ **Source:** KL-distilled from [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) using [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) as teacher
91
+ **Built with:** [TokForge](https://tokforge.ai)
config.json CHANGED
@@ -7,4 +7,4 @@
7
  "memory": "low",
8
  "sampler_type": "penalty",
9
  "penalty": 1.1
10
- }
 
7
  "memory": "low",
8
  "sampler_type": "penalty",
9
  "penalty": 1.1
10
+ }
config_cpu.json CHANGED
@@ -2,8 +2,9 @@
2
  "llm_model": "llm.mnn",
3
  "llm_weight": "llm.mnn.weight",
4
  "backend_type": "cpu",
5
- "thread_num": 1,
6
  "precision": "low",
7
- "power": "high",
8
- "sampler_type": "greedy"
 
9
  }
 
2
  "llm_model": "llm.mnn",
3
  "llm_weight": "llm.mnn.weight",
4
  "backend_type": "cpu",
5
+ "thread_num": 2,
6
  "precision": "low",
7
+ "memory": "low",
8
+ "sampler_type": "greedy",
9
+ "power": "high"
10
  }
draft_config_cpu.json CHANGED
@@ -2,8 +2,9 @@
2
  "llm_model": "llm.mnn",
3
  "llm_weight": "llm.mnn.weight",
4
  "backend_type": "cpu",
5
- "thread_num": 1,
6
  "precision": "low",
7
- "power": "high",
8
- "sampler_type": "greedy"
 
9
  }
 
2
  "llm_model": "llm.mnn",
3
  "llm_weight": "llm.mnn.weight",
4
  "backend_type": "cpu",
5
+ "thread_num": 2,
6
  "precision": "low",
7
+ "memory": "low",
8
+ "sampler_type": "greedy",
9
+ "power": "high"
10
  }
llm.mnn CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9fda2182b1d37d5349c8834723d0d9d6f28a7edb16353f8c40240e435a670986
3
  size 503616
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4386b519389eacd94d2262c457f81776b87ad2742f886c38f5516097a7f15d30
3
  size 503616
llm.mnn.weight CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:70d766afcf050c83dcae506e122ca7d15ae352e6c6205c280351c8d9a5d48bd5
3
- size 373018866
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3f8042b4185570328c6ef5fac5a2aaf49f9c3d81b25693436d286b13f2a568f1
3
+ size 335769842
llm_config.json CHANGED
@@ -5,14 +5,14 @@
5
  "attention_type": "full",
6
  "is_mrope": false,
7
  "jinja": {
8
- "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0].role == 'system' %}\n {{- messages[0].content + '\\n\\n' }}\n {%- endif %}\n {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0].role == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0].content + '<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n {%- set index = (messages|length - 1) - loop.index0 %}\n {%- if ns.multi_step_tool and message.role == \"user\" and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n {%- set ns.multi_step_tool = false %}\n {%- set ns.last_query_index = index %}\n {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {%- set content = message.content %}\n {%- set reasoning_content = '' %}\n {%- if message.reasoning_content is defined and message.reasoning_content is not none %}\n {%- set reasoning_content = message.reasoning_content %}\n {%- else %}\n {%- if '</think>' in message.content %}\n {%- set content = message.content.split('</think>')[-1].lstrip('\\n') %}\n {%- set reasoning_content = message.content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n {%- endif %}\n {%- endif %}\n {%- if loop.index0 > ns.last_query_index %}\n {%- if loop.last or (not loop.last and reasoning_content) %}\n {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content.strip('\\n') + '\\n</think>\\n\\n' + content.lstrip('\\n') }}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- endif %}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- endif %}\n {%- if message.tool_calls %}\n {%- for tool_call in message.tool_calls %}\n {%- if (loop.first and content) or (not loop.first) %}\n {{- '\\n' }}\n {%- endif %}\n {%- if tool_call.function %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {%- if tool_call.arguments is string %}\n {{- tool_call.arguments }}\n {%- else %}\n {{- tool_call.arguments | tojson }}\n {%- endif %}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {%- endif %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n {%- if enable_thinking is defined and enable_thinking is false %}\n {{- '<think>\\n\\n</think>\\n\\n' }}\n {%- endif %}\n{%- endif %}",
9
  "eos": "<|im_end|>"
10
  },
11
  "tie_embeddings": [
12
- 275779826,
13
- 353571058,
14
- 19447808,
15
  4,
16
- 64
17
  ]
18
  }
 
5
  "attention_type": "full",
6
  "is_mrope": false,
7
  "jinja": {
8
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0].role == 'system' %}\n {{- messages[0].content + '\\n\\n' }}\n {%- endif %}\n {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0].role == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0].content + '<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n {%- set index = (messages|length - 1) - loop.index0 %}\n {%- if ns.multi_step_tool and message.role == \"user\" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n {%- set ns.multi_step_tool = false %}\n {%- set ns.last_query_index = index %}\n {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n {%- if message.content is string %}\n {%- set content = message.content %}\n {%- else %}\n {%- set content = '' %}\n {%- endif %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n {{- '<|im_start|>' + message.role + '\\n' + content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {%- set reasoning_content = '' %}\n {%- if message.reasoning_content is string %}\n {%- set reasoning_content = message.reasoning_content %}\n {%- else %}\n {%- if '</think>' in content %}\n {%- set reasoning_content = content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n {%- set content = content.split('</think>')[-1].lstrip('\\n') %}\n {%- endif %}\n {%- endif %}\n {%- if loop.index0 > ns.last_query_index %}\n {%- if loop.last or (not loop.last and reasoning_content) %}\n {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content.strip('\\n') + '\\n</think>\\n\\n' + content.lstrip('\\n') }}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- endif %}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- endif %}\n {%- if message.tool_calls %}\n {%- for tool_call in message.tool_calls %}\n {%- if (loop.first and content) or (not loop.first) %}\n {{- '\\n' }}\n {%- endif %}\n {%- if tool_call.function %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {%- if tool_call.arguments is string %}\n {{- tool_call.arguments }}\n {%- else %}\n {{- tool_call.arguments | tojson }}\n {%- endif %}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {%- endif %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n {%- if enable_thinking is defined and enable_thinking is false %}\n {{- '<think>\\n\\n</think>\\n\\n' }}\n {%- endif %}\n{%- endif %}",
9
  "eos": "<|im_end|>"
10
  },
11
  "tie_embeddings": [
12
+ 248254706,
13
+ 326045938,
14
+ 9723904,
15
  4,
16
+ 128
17
  ]
18
  }
tokenizer.txt CHANGED
The diff for this file is too large to render. See raw diff