docs: update README — abliterated draft with benchmarks
Browse files
README.md
CHANGED
|
@@ -1,66 +1,46 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
|
| 8 |
|
| 9 |
-
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
| Target Model | Baseline
|
| 14 |
-
|---|---|---|---|
|
| 15 |
-
| Qwen3-
|
| 16 |
-
| Qwen3
|
|
|
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
- Qwen3-4B (MNN, OpenCL or CPU)
|
| 23 |
-
- Qwen3-8B (MNN, OpenCL)
|
| 24 |
-
- Qwen3-14B (MNN, OpenCL, 24GB+ RAM devices)
|
| 25 |
-
- Qwen3.5-9B (MNN, CPU)
|
| 26 |
-
|
| 27 |
-
## Installation
|
| 28 |
-
|
| 29 |
-
### Automatic (Recommended)
|
| 30 |
-
|
| 31 |
-
In TokForge: **Settings > Advanced > Speculative Decoding > Download Acceleration Pack**
|
| 32 |
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
### Manual
|
| 36 |
-
|
| 37 |
-
1. Download all files from this repository
|
| 38 |
-
2. Copy them to: `<device>/Android/data/dev.tokforge/files/models/spec-decode-draft/`
|
| 39 |
-
3. Enable speculative decoding in TokForge Settings > Advanced
|
| 40 |
-
|
| 41 |
-
## Files
|
| 42 |
-
|
| 43 |
-
| File | Size | Description |
|
| 44 |
-
|---|---|---|
|
| 45 |
-
| `llm.mnn` | ~450 KB | MNN model graph |
|
| 46 |
-
| `llm.mnn.weight` | ~430 MB | Model weights |
|
| 47 |
-
| `tokenizer.txt` | ~3 MB | Tokenizer vocabulary |
|
| 48 |
-
| `config.json` | <1 KB | Model configuration |
|
| 49 |
-
| `llm_config.json` | ~5 KB | LLM inference configuration |
|
| 50 |
-
| `config_cpu.json` | <1 KB | CPU backend configuration for draft inference |
|
| 51 |
-
| `draft_config_cpu.json` | <1 KB | Draft-specific CPU configuration |
|
| 52 |
|
| 53 |
-
|
|
|
|
| 54 |
|
| 55 |
-
|
| 56 |
-
- **Format**: MNN (Mobile Neural Network by Alibaba)
|
| 57 |
-
- **Backend**: CPU with single-thread, low-precision inference for minimal draft overhead (~21ms/token)
|
| 58 |
-
- **Optimal draft length**: d=2 for 8B targets, d=3 for 9B targets
|
| 59 |
|
| 60 |
## Source
|
| 61 |
|
| 62 |
-
|
| 63 |
|
| 64 |
-
##
|
| 65 |
|
| 66 |
-
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model: huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2
|
| 4 |
+
tags:
|
| 5 |
+
- mnn
|
| 6 |
+
- abliterated
|
| 7 |
+
- uncensored
|
| 8 |
+
- draft-model
|
| 9 |
+
- spec-decode
|
| 10 |
+
- tokforge
|
| 11 |
+
---
|
| 12 |
|
| 13 |
+
# TokForge Acceleration Pack — Qwen3 Draft Model
|
| 14 |
|
| 15 |
+
Abliterated Qwen3-0.6B draft model for speculative decoding in [TokForge](https://tokforge.ai).
|
| 16 |
|
| 17 |
+
## What This Is
|
| 18 |
|
| 19 |
+
A small (0.6B) abliterated model that runs on CPU alongside your main GPU model, predicting tokens in parallel. The main model batch-verifies predictions, accepting correct ones instantly — resulting in **+41% to +63% decode speed** on Qwen3 4B/8B/14B targets.
|
| 20 |
|
| 21 |
+
## Performance (RedMagic SM8850)
|
| 22 |
|
| 23 |
+
| Target Model | Baseline | With Draft | Uplift |
|
| 24 |
+
|-------------|----------|------------|--------|
|
| 25 |
+
| Qwen3-4B | 16.5 tok/s | 23.2 tok/s | **+41%** |
|
| 26 |
+
| Qwen3-8B | 11.0 tok/s | 15.6 tok/s | **+50%** |
|
| 27 |
+
| Qwen3-14B | 6.4 tok/s | 10.4 tok/s | **+63%** |
|
| 28 |
|
| 29 |
+
## Why Abliterated?
|
| 30 |
|
| 31 |
+
This draft model is abliterated (refusal behavior removed) for +5.5% better token acceptance rate compared to the censored version. It works equally well with both censored and uncensored target models — the abliterated draft simply predicts more accurately across all content types.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
+
## Compatible Target Models
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
- All Qwen3 MNN models (4B, 8B, 14B) — censored or uncensored
|
| 36 |
+
- **NOT compatible** with Qwen3.5 models (different tokenizer, 303K vs 495K vocab)
|
| 37 |
|
| 38 |
+
For Qwen3.5 targets, use [TokForge-AccelerationPack-Qwen35-Draft](https://huggingface.co/darkmaniac7/TokForge-AccelerationPack-Qwen35-Draft).
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
## Source
|
| 41 |
|
| 42 |
+
Converted from [huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2](https://huggingface.co/huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2) using MNN 4-bit HQQ quantization (quant_block=64).
|
| 43 |
|
| 44 |
+
## Usage
|
| 45 |
|
| 46 |
+
Download via TokForge app → Settings → Spec Decode → Download Acceleration Pack.
|