darkmaniac7
/

TokForge-AccelerationPack-Draft

@@ -1,66 +1,46 @@
-# TokForge Acceleration Pack — Draft Model
-**Qwen3-0.6B (MNN)** draft model for speculative decoding in [TokForge](https://tokforge.ai).
-## What is this?
-This is a curated, ready-to-use 0.6B parameter draft model packaged for TokForge's speculative decoding pipeline. Speculative decoding uses a small, fast "draft" model to propose candidate tokens, which are then verified in parallel by the larger "target" model. This amortizes the cost of target model inference and delivers significant speed improvements.
-## Performance
-Tested on Snapdragon 8 Elite 2 (SM8850 / Adreno 840):
-| Target Model | Baseline (tok/s) | With Spec-Decode (tok/s) | Improvement |
-|---|---|---|---|
-| Qwen3-8B (OpenCL) | 11.41 | 19.60 | **+72%** |
-| Qwen3.5-9B (CPU) | 9.55 | 16.67 | **+75%** |
-Improvements vary by device and target model. Typical range: **+18-75%** decode speed on flagship SoCs (SM8850, SM8650).
-## Compatible Target Models
-- Qwen3-4B (MNN, OpenCL or CPU)
-- Qwen3-8B (MNN, OpenCL)
-- Qwen3-14B (MNN, OpenCL, 24GB+ RAM devices)
-- Qwen3.5-9B (MNN, CPU)
-## Installation
-### Automatic (Recommended)
-In TokForge: **Settings > Advanced > Speculative Decoding > Download Acceleration Pack**
-The app will download and configure the draft model automatically.
-### Manual
-1. Download all files from this repository
-2. Copy them to: `<device>/Android/data/dev.tokforge/files/models/spec-decode-draft/`
-3. Enable speculative decoding in TokForge Settings > Advanced
-## Files
-| File | Size | Description |
-|---|---|---|
-| `llm.mnn` | ~450 KB | MNN model graph |
-| `llm.mnn.weight` | ~430 MB | Model weights |
-| `tokenizer.txt` | ~3 MB | Tokenizer vocabulary |
-| `config.json` | <1 KB | Model configuration |
-| `llm_config.json` | ~5 KB | LLM inference configuration |
-| `config_cpu.json` | <1 KB | CPU backend configuration for draft inference |
-| `draft_config_cpu.json` | <1 KB | Draft-specific CPU configuration |
-## Technical Details
-- **Architecture**: Qwen3-0.6B (28 transformer layers, 1024 hidden dim)
-- **Format**: MNN (Mobile Neural Network by Alibaba)
-- **Backend**: CPU with single-thread, low-precision inference for minimal draft overhead (~21ms/token)
-- **Optimal draft length**: d=2 for 8B targets, d=3 for 9B targets
 ## Source
-Based on [taobao-mnn/Qwen3-0.6B-MNN](https://huggingface.co/taobao-mnn/Qwen3-0.6B-MNN), repackaged with TokForge-specific draft configurations.
-## License
-This model inherits the license from [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B). See the original model card for license terms.

+---
+license: apache-2.0
+base_model: huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2
+tags:
+- mnn
+- abliterated
+- uncensored
+- draft-model
+- spec-decode
+- tokforge
+---
+# TokForge Acceleration Pack — Qwen3 Draft Model
+Abliterated Qwen3-0.6B draft model for speculative decoding in [TokForge](https://tokforge.ai).
+## What This Is
+A small (0.6B) abliterated model that runs on CPU alongside your main GPU model, predicting tokens in parallel. The main model batch-verifies predictions, accepting correct ones instantly — resulting in **+41% to +63% decode speed** on Qwen3 4B/8B/14B targets.
+## Performance (RedMagic SM8850)
+| Target Model | Baseline | With Draft | Uplift |
+|-------------|----------|------------|--------|
+| Qwen3-4B | 16.5 tok/s | 23.2 tok/s | **+41%** |
+| Qwen3-8B | 11.0 tok/s | 15.6 tok/s | **+50%** |
+| Qwen3-14B | 6.4 tok/s | 10.4 tok/s | **+63%** |
+## Why Abliterated?
+This draft model is abliterated (refusal behavior removed) for +5.5% better token acceptance rate compared to the censored version. It works equally well with both censored and uncensored target models — the abliterated draft simply predicts more accurately across all content types.
+## Compatible Target Models
+- All Qwen3 MNN models (4B, 8B, 14B) — censored or uncensored
+- **NOT compatible** with Qwen3.5 models (different tokenizer, 303K vs 495K vocab)
+For Qwen3.5 targets, use [TokForge-AccelerationPack-Qwen35-Draft](https://huggingface.co/darkmaniac7/TokForge-AccelerationPack-Qwen35-Draft).
 ## Source
+Converted from [huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2](https://huggingface.co/huihui-ai/Huihui-Qwen3-0.6B-abliterated-v2) using MNN 4-bit HQQ quantization (quant_block=64).
+## Usage
+Download via TokForge app → Settings → Spec Decode → Download Acceleration Pack.