darkmaniac7 commited on
Commit
51a33fe
·
verified ·
1 Parent(s): af3c044

docs: update README — abliterated draft with benchmarks

Browse files
Files changed (1) hide show
  1. README.md +30 -50
README.md CHANGED
@@ -1,66 +1,46 @@
1
- # TokForge Acceleration Pack — Draft Model
 
 
 
 
 
 
 
 
 
 
2
 
3
- **Qwen3-0.6B (MNN)** draft model for speculative decoding in [TokForge](https://tokforge.ai).
4
 
5
- ## What is this?
6
 
7
- This is a curated, ready-to-use 0.6B parameter draft model packaged for TokForge's speculative decoding pipeline. Speculative decoding uses a small, fast "draft" model to propose candidate tokens, which are then verified in parallel by the larger "target" model. This amortizes the cost of target model inference and delivers significant speed improvements.
8
 
9
- ## Performance
10
 
11
- Tested on Snapdragon 8 Elite 2 (SM8850 / Adreno 840):
12
 
13
- | Target Model | Baseline (tok/s) | With Spec-Decode (tok/s) | Improvement |
14
- |---|---|---|---|
15
- | Qwen3-8B (OpenCL) | 11.41 | 19.60 | **+72%** |
16
- | Qwen3.5-9B (CPU) | 9.55 | 16.67 | **+75%** |
 
17
 
18
- Improvements vary by device and target model. Typical range: **+18-75%** decode speed on flagship SoCs (SM8850, SM8650).
19
 
20
- ## Compatible Target Models
21
-
22
- - Qwen3-4B (MNN, OpenCL or CPU)
23
- - Qwen3-8B (MNN, OpenCL)
24
- - Qwen3-14B (MNN, OpenCL, 24GB+ RAM devices)
25
- - Qwen3.5-9B (MNN, CPU)
26
-
27
- ## Installation
28
-
29
- ### Automatic (Recommended)
30
-
31
- In TokForge: **Settings > Advanced > Speculative Decoding > Download Acceleration Pack**
32
 
33
- The app will download and configure the draft model automatically.
34
-
35
- ### Manual
36
-
37
- 1. Download all files from this repository
38
- 2. Copy them to: `<device>/Android/data/dev.tokforge/files/models/spec-decode-draft/`
39
- 3. Enable speculative decoding in TokForge Settings > Advanced
40
-
41
- ## Files
42
-
43
- | File | Size | Description |
44
- |---|---|---|
45
- | `llm.mnn` | ~450 KB | MNN model graph |
46
- | `llm.mnn.weight` | ~430 MB | Model weights |
47
- | `tokenizer.txt` | ~3 MB | Tokenizer vocabulary |
48
- | `config.json` | <1 KB | Model configuration |
49
- | `llm_config.json` | ~5 KB | LLM inference configuration |
50
- | `config_cpu.json` | <1 KB | CPU backend configuration for draft inference |
51
- | `draft_config_cpu.json` | <1 KB | Draft-specific CPU configuration |
52
 
53
- ## Technical Details
 
54
 
55
- - **Architecture**: Qwen3-0.6B (28 transformer layers, 1024 hidden dim)
56
- - **Format**: MNN (Mobile Neural Network by Alibaba)
57
- - **Backend**: CPU with single-thread, low-precision inference for minimal draft overhead (~21ms/token)
58
- - **Optimal draft length**: d=2 for 8B targets, d=3 for 9B targets
59
 
60
  ## Source
61
 
62
- Based on [taobao-mnn/Qwen3-0.6B-MNN](https://huggingface.co/taobao-mnn/Qwen3-0.6B-MNN), repackaged with TokForge-specific draft configurations.
63
 
64
- ## License
65
 
66
- This model inherits the license from [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B). See the original model card for license terms.
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2
4
+ tags:
5
+ - mnn
6
+ - abliterated
7
+ - uncensored
8
+ - draft-model
9
+ - spec-decode
10
+ - tokforge
11
+ ---
12
 
13
+ # TokForge Acceleration Pack Qwen3 Draft Model
14
 
15
+ Abliterated Qwen3-0.6B draft model for speculative decoding in [TokForge](https://tokforge.ai).
16
 
17
+ ## What This Is
18
 
19
+ A small (0.6B) abliterated model that runs on CPU alongside your main GPU model, predicting tokens in parallel. The main model batch-verifies predictions, accepting correct ones instantly — resulting in **+41% to +63% decode speed** on Qwen3 4B/8B/14B targets.
20
 
21
+ ## Performance (RedMagic SM8850)
22
 
23
+ | Target Model | Baseline | With Draft | Uplift |
24
+ |-------------|----------|------------|--------|
25
+ | Qwen3-4B | 16.5 tok/s | 23.2 tok/s | **+41%** |
26
+ | Qwen3-8B | 11.0 tok/s | 15.6 tok/s | **+50%** |
27
+ | Qwen3-14B | 6.4 tok/s | 10.4 tok/s | **+63%** |
28
 
29
+ ## Why Abliterated?
30
 
31
+ This draft model is abliterated (refusal behavior removed) for +5.5% better token acceptance rate compared to the censored version. It works equally well with both censored and uncensored target models — the abliterated draft simply predicts more accurately across all content types.
 
 
 
 
 
 
 
 
 
 
 
32
 
33
+ ## Compatible Target Models
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
+ - All Qwen3 MNN models (4B, 8B, 14B) — censored or uncensored
36
+ - **NOT compatible** with Qwen3.5 models (different tokenizer, 303K vs 495K vocab)
37
 
38
+ For Qwen3.5 targets, use [TokForge-AccelerationPack-Qwen35-Draft](https://huggingface.co/darkmaniac7/TokForge-AccelerationPack-Qwen35-Draft).
 
 
 
39
 
40
  ## Source
41
 
42
+ Converted from [huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2](https://huggingface.co/huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2) using MNN 4-bit HQQ quantization (quant_block=64).
43
 
44
+ ## Usage
45
 
46
+ Download via TokForge app Settings Spec Decode Download Acceleration Pack.