Add MNN Q4 conversion for TokForge mobile inference
Browse files- .gitattributes +2 -0
- README.md +99 -0
- config.json +10 -0
- embeddings_bf16.bin +3 -0
- export_args.json +42 -0
- llm.mnn +3 -0
- llm.mnn.weight +3 -0
- llm_config.json +12 -0
- tokenizer.txt +0 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
llm.mnn filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
llm.mnn.weight filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
pipeline_tag: text-generation
|
| 6 |
+
base_model: mistralai/Mistral-Small-24B-Instruct-2501
|
| 7 |
+
tags:
|
| 8 |
+
- mnn
|
| 9 |
+
- mistral
|
| 10 |
+
- mobile
|
| 11 |
+
- on-device
|
| 12 |
+
- tokforge
|
| 13 |
+
- uncensored
|
| 14 |
+
- abliterated
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# Mistral-Small-24B-Instruct-MNN
|
| 18 |
+
|
| 19 |
+
Pre-converted [Mistral Small 24B Instruct](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501) in MNN format for on-device inference with [TokForge](https://tokforge.ai).
|
| 20 |
+
|
| 21 |
+
> **Original model by [Mistral AI](https://huggingface.co/Mistral AI)** — converted to MNN Q4 for mobile deployment.
|
| 22 |
+
|
| 23 |
+
## Model Details
|
| 24 |
+
|
| 25 |
+
| | |
|
| 26 |
+
|---|---|
|
| 27 |
+
| **Architecture** | Mistral (sliding window attention, 40 layers) |
|
| 28 |
+
| **Parameters** | 24B (4-bit quantized) |
|
| 29 |
+
| **Format** | MNN (Alibaba Mobile Neural Network) |
|
| 30 |
+
| **Quantization** | W4A16 (4-bit weights, block size 128) |
|
| 31 |
+
| **Vocab** | 32,768 tokens |
|
| 32 |
+
| **Source** | [mistralai/Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501) |
|
| 33 |
+
|
| 34 |
+
## Description
|
| 35 |
+
|
| 36 |
+
Mistral AI's Small 24B — a knowledge-dense 24B model that fits on a single high-end GPU. Excellent for complex reasoning, function calling, and multi-step tasks. Runs on flagship phones with 24GB RAM. The most capable model in this collection.
|
| 37 |
+
|
| 38 |
+
## Files
|
| 39 |
+
|
| 40 |
+
| File | Description |
|
| 41 |
+
|------|-------------|
|
| 42 |
+
| `llm.mnn` | Model computation graph |
|
| 43 |
+
| `llm.mnn.weight` | Quantized weight data (Q4, block=128) |
|
| 44 |
+
| `llm_config.json` | Model config with Jinja chat template |
|
| 45 |
+
| `tokenizer.txt` | Tokenizer vocabulary |
|
| 46 |
+
| `config.json` | MNN runtime config |
|
| 47 |
+
|
| 48 |
+
## Usage with TokForge
|
| 49 |
+
|
| 50 |
+
This model is optimized for **[TokForge](https://tokforge.ai)** — a free Android app for private, on-device LLM inference.
|
| 51 |
+
|
| 52 |
+
1. Download [TokForge from the Play Store](https://tokforge.ai)
|
| 53 |
+
2. Open the app → Models → Download this model
|
| 54 |
+
3. Start chatting — runs 100% locally, no internet required
|
| 55 |
+
|
| 56 |
+
### Recommended Settings
|
| 57 |
+
|
| 58 |
+
| Setting | Value |
|
| 59 |
+
|---------|-------|
|
| 60 |
+
| Backend | OpenCL (Qualcomm) / Vulkan (MediaTek) / CPU (fallback) |
|
| 61 |
+
| Precision | Low |
|
| 62 |
+
| Threads | 4 |
|
| 63 |
+
| Thinking | Off (or On for thinking-capable models) |
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
## Performance
|
| 68 |
+
|
| 69 |
+
Actual speed varies by device, thermal state, and generation length. Typical ranges for this model size:
|
| 70 |
+
|
| 71 |
+
| Device | SoC | Backend | tok/s |
|
| 72 |
+
|---|---|---|---|
|
| 73 |
+
| RedMagic 11 Pro (24GB) | SM8850 | OpenCL | ~5-6 tok/s |
|
| 74 |
+
|
| 75 |
+
> **Note:** Requires 24GB+ RAM. Best for flagship phones with 24GB RAM and minimal background apps.
|
| 76 |
+
|
| 77 |
+
## Attribution
|
| 78 |
+
|
| 79 |
+
This is an MNN conversion of **[Mistral Small 24B Instruct](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501)** by **[Mistral AI](https://huggingface.co/Mistral AI)**. All credit for the model architecture, training, and fine-tuning goes to the original author(s). This conversion only changes the runtime format for mobile deployment.
|
| 80 |
+
|
| 81 |
+
## Limitations
|
| 82 |
+
|
| 83 |
+
- Intended for TokForge / MNN on-device inference on Android
|
| 84 |
+
- This is a runtime bundle, not a standard Transformers training checkpoint
|
| 85 |
+
- Quantization (Q4) may slightly reduce quality compared to the full-precision original
|
| 86 |
+
- Abliterated/uncensored models have had safety filters removed — **use responsibly**
|
| 87 |
+
|
| 88 |
+
## Community
|
| 89 |
+
|
| 90 |
+
- **Website:** [tokforge.ai](https://tokforge.ai)
|
| 91 |
+
- **Discord:** [Join our Discord](https://discord.gg/Acv3CBtfVm)
|
| 92 |
+
- **GitHub:** [TokForge on GitHub](https://github.com/darkmaniac7/Elysium)
|
| 93 |
+
|
| 94 |
+
## Export Details
|
| 95 |
+
|
| 96 |
+
Converted using MNN's `llmexport` pipeline:
|
| 97 |
+
```bash
|
| 98 |
+
python llmexport.py --path mistralai/Mistral-Small-24B-Instruct-2501 --export mnn --quant_bit 4 --quant_block 128
|
| 99 |
+
```
|
config.json
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"llm_model": "llm.mnn",
|
| 3 |
+
"llm_weight": "llm.mnn.weight",
|
| 4 |
+
"backend_type": "cpu",
|
| 5 |
+
"thread_num": 4,
|
| 6 |
+
"precision": "low",
|
| 7 |
+
"memory": "low",
|
| 8 |
+
"sampler_type": "penalty",
|
| 9 |
+
"penalty": 1.1
|
| 10 |
+
}
|
embeddings_bf16.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:be028c1aa62cb6cb18d5c0cfdb938061c95f816f50460103dc7f17633092515a
|
| 3 |
+
size 1342177280
|
export_args.json
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"path": "/root/models/hf_convert_queue/Mistral-Small-24B-Instruct",
|
| 3 |
+
"type": null,
|
| 4 |
+
"tokenizer_path": "/root/models/hf_convert_queue/Mistral-Small-24B-Instruct",
|
| 5 |
+
"eagle_path": null,
|
| 6 |
+
"lora_path": null,
|
| 7 |
+
"gptq_path": null,
|
| 8 |
+
"dst_path": "/root/models/hf_uploads/Mistral-Small-24B-Instruct-MNN",
|
| 9 |
+
"verbose": false,
|
| 10 |
+
"test": null,
|
| 11 |
+
"export": "mnn",
|
| 12 |
+
"onnx_slim": false,
|
| 13 |
+
"quant_bit": 4,
|
| 14 |
+
"quant_block": 128,
|
| 15 |
+
"visual_quant_bit": null,
|
| 16 |
+
"visual_quant_block": null,
|
| 17 |
+
"lm_quant_bit": 4,
|
| 18 |
+
"lm_quant_block": 128,
|
| 19 |
+
"mnnconvert": "../../../build/MNNConvert",
|
| 20 |
+
"ppl": false,
|
| 21 |
+
"awq": false,
|
| 22 |
+
"hqq": false,
|
| 23 |
+
"omni": false,
|
| 24 |
+
"transformer_fuse": false,
|
| 25 |
+
"group_conv_native": false,
|
| 26 |
+
"smooth": false,
|
| 27 |
+
"sym": false,
|
| 28 |
+
"visual_sym": false,
|
| 29 |
+
"seperate_embed": true,
|
| 30 |
+
"lora_split": false,
|
| 31 |
+
"calib_data": null,
|
| 32 |
+
"act_bit": 16,
|
| 33 |
+
"embed_bit": 16,
|
| 34 |
+
"act_sym": false,
|
| 35 |
+
"quant_config": null,
|
| 36 |
+
"generate_for_npu": false,
|
| 37 |
+
"skip_weight": false,
|
| 38 |
+
"omni_epochs": 20,
|
| 39 |
+
"omni_lr": 0.005,
|
| 40 |
+
"omni_wd": 0.0001,
|
| 41 |
+
"tie_word_embeddings": false
|
| 42 |
+
}
|
llm.mnn
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2c55cc27d7e48c7c81513d6ec72f41df4cca38fc73cb4d97705bcc808d370e30
|
| 3 |
+
size 695576
|
llm.mnn.weight
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a0721394f97c7e276177e35b96182903cae4d237876600e84a1bd5cfd6643760
|
| 3 |
+
size 12885080106
|
llm_config.json
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "mistral",
|
| 3 |
+
"hidden_size": 5120,
|
| 4 |
+
"attention_mask": "float",
|
| 5 |
+
"attention_type": "full",
|
| 6 |
+
"is_mrope": false,
|
| 7 |
+
"jinja": {
|
| 8 |
+
"chat_template": "{%- set today = strftime_now(\"%Y-%m-%d\") %}\n{%- set default_system_message = \"You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.\\nYour knowledge base was last updated on 2023-10-01. The current date is \" + today + \".\\n\\nWhen you're not sure about some information, you say that you don't have the information and don't make up anything.\\nIf the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \\\"What are some good restaurants around me?\\\" => \\\"Where are you?\\\" or \\\"When is the next flight to Tokyo\\\" => \\\"Where do you travel from?\\\")\" %}\n\n{{- bos_token }}\n\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content'] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set system_message = default_system_message %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{{- '[SYSTEM_PROMPT]' + system_message + '[/SYSTEM_PROMPT]' }}\n\n{%- for message in loop_messages %}\n {%- if message['role'] == 'user' %}\n {{- '[INST]' + message['content'] + '[/INST]' }}\n {%- elif message['role'] == 'system' %}\n {{- '[SYSTEM_PROMPT]' + message['content'] + '[/SYSTEM_PROMPT]' }}\n {%- elif message['role'] == 'assistant' %}\n {{- message['content'] + eos_token }}\n {%- else %}\n {{- raise_exception('Only user, system and assistant roles are supported!') }}\n {%- endif %}\n{%- endfor %}",
|
| 9 |
+
"bos": "<s>",
|
| 10 |
+
"eos": "</s>"
|
| 11 |
+
}
|
| 12 |
+
}
|
tokenizer.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|