darkmaniac7 commited on
Commit
bf29b79
·
verified ·
1 Parent(s): 2d83c93

Add MNN Q4 conversion for TokForge mobile inference

Browse files
Files changed (9) hide show
  1. .gitattributes +2 -0
  2. README.md +99 -0
  3. config.json +10 -0
  4. embeddings_bf16.bin +3 -0
  5. export_args.json +42 -0
  6. llm.mnn +3 -0
  7. llm.mnn.weight +3 -0
  8. llm_config.json +12 -0
  9. tokenizer.txt +0 -0
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ llm.mnn filter=lfs diff=lfs merge=lfs -text
37
+ llm.mnn.weight filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ base_model: mistralai/Mistral-Small-24B-Instruct-2501
7
+ tags:
8
+ - mnn
9
+ - mistral
10
+ - mobile
11
+ - on-device
12
+ - tokforge
13
+ - uncensored
14
+ - abliterated
15
+ ---
16
+
17
+ # Mistral-Small-24B-Instruct-MNN
18
+
19
+ Pre-converted [Mistral Small 24B Instruct](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501) in MNN format for on-device inference with [TokForge](https://tokforge.ai).
20
+
21
+ > **Original model by [Mistral AI](https://huggingface.co/Mistral AI)** — converted to MNN Q4 for mobile deployment.
22
+
23
+ ## Model Details
24
+
25
+ | | |
26
+ |---|---|
27
+ | **Architecture** | Mistral (sliding window attention, 40 layers) |
28
+ | **Parameters** | 24B (4-bit quantized) |
29
+ | **Format** | MNN (Alibaba Mobile Neural Network) |
30
+ | **Quantization** | W4A16 (4-bit weights, block size 128) |
31
+ | **Vocab** | 32,768 tokens |
32
+ | **Source** | [mistralai/Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501) |
33
+
34
+ ## Description
35
+
36
+ Mistral AI's Small 24B — a knowledge-dense 24B model that fits on a single high-end GPU. Excellent for complex reasoning, function calling, and multi-step tasks. Runs on flagship phones with 24GB RAM. The most capable model in this collection.
37
+
38
+ ## Files
39
+
40
+ | File | Description |
41
+ |------|-------------|
42
+ | `llm.mnn` | Model computation graph |
43
+ | `llm.mnn.weight` | Quantized weight data (Q4, block=128) |
44
+ | `llm_config.json` | Model config with Jinja chat template |
45
+ | `tokenizer.txt` | Tokenizer vocabulary |
46
+ | `config.json` | MNN runtime config |
47
+
48
+ ## Usage with TokForge
49
+
50
+ This model is optimized for **[TokForge](https://tokforge.ai)** — a free Android app for private, on-device LLM inference.
51
+
52
+ 1. Download [TokForge from the Play Store](https://tokforge.ai)
53
+ 2. Open the app → Models → Download this model
54
+ 3. Start chatting — runs 100% locally, no internet required
55
+
56
+ ### Recommended Settings
57
+
58
+ | Setting | Value |
59
+ |---------|-------|
60
+ | Backend | OpenCL (Qualcomm) / Vulkan (MediaTek) / CPU (fallback) |
61
+ | Precision | Low |
62
+ | Threads | 4 |
63
+ | Thinking | Off (or On for thinking-capable models) |
64
+
65
+
66
+
67
+ ## Performance
68
+
69
+ Actual speed varies by device, thermal state, and generation length. Typical ranges for this model size:
70
+
71
+ | Device | SoC | Backend | tok/s |
72
+ |---|---|---|---|
73
+ | RedMagic 11 Pro (24GB) | SM8850 | OpenCL | ~5-6 tok/s |
74
+
75
+ > **Note:** Requires 24GB+ RAM. Best for flagship phones with 24GB RAM and minimal background apps.
76
+
77
+ ## Attribution
78
+
79
+ This is an MNN conversion of **[Mistral Small 24B Instruct](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501)** by **[Mistral AI](https://huggingface.co/Mistral AI)**. All credit for the model architecture, training, and fine-tuning goes to the original author(s). This conversion only changes the runtime format for mobile deployment.
80
+
81
+ ## Limitations
82
+
83
+ - Intended for TokForge / MNN on-device inference on Android
84
+ - This is a runtime bundle, not a standard Transformers training checkpoint
85
+ - Quantization (Q4) may slightly reduce quality compared to the full-precision original
86
+ - Abliterated/uncensored models have had safety filters removed — **use responsibly**
87
+
88
+ ## Community
89
+
90
+ - **Website:** [tokforge.ai](https://tokforge.ai)
91
+ - **Discord:** [Join our Discord](https://discord.gg/Acv3CBtfVm)
92
+ - **GitHub:** [TokForge on GitHub](https://github.com/darkmaniac7/Elysium)
93
+
94
+ ## Export Details
95
+
96
+ Converted using MNN's `llmexport` pipeline:
97
+ ```bash
98
+ python llmexport.py --path mistralai/Mistral-Small-24B-Instruct-2501 --export mnn --quant_bit 4 --quant_block 128
99
+ ```
config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "llm_model": "llm.mnn",
3
+ "llm_weight": "llm.mnn.weight",
4
+ "backend_type": "cpu",
5
+ "thread_num": 4,
6
+ "precision": "low",
7
+ "memory": "low",
8
+ "sampler_type": "penalty",
9
+ "penalty": 1.1
10
+ }
embeddings_bf16.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be028c1aa62cb6cb18d5c0cfdb938061c95f816f50460103dc7f17633092515a
3
+ size 1342177280
export_args.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "path": "/root/models/hf_convert_queue/Mistral-Small-24B-Instruct",
3
+ "type": null,
4
+ "tokenizer_path": "/root/models/hf_convert_queue/Mistral-Small-24B-Instruct",
5
+ "eagle_path": null,
6
+ "lora_path": null,
7
+ "gptq_path": null,
8
+ "dst_path": "/root/models/hf_uploads/Mistral-Small-24B-Instruct-MNN",
9
+ "verbose": false,
10
+ "test": null,
11
+ "export": "mnn",
12
+ "onnx_slim": false,
13
+ "quant_bit": 4,
14
+ "quant_block": 128,
15
+ "visual_quant_bit": null,
16
+ "visual_quant_block": null,
17
+ "lm_quant_bit": 4,
18
+ "lm_quant_block": 128,
19
+ "mnnconvert": "../../../build/MNNConvert",
20
+ "ppl": false,
21
+ "awq": false,
22
+ "hqq": false,
23
+ "omni": false,
24
+ "transformer_fuse": false,
25
+ "group_conv_native": false,
26
+ "smooth": false,
27
+ "sym": false,
28
+ "visual_sym": false,
29
+ "seperate_embed": true,
30
+ "lora_split": false,
31
+ "calib_data": null,
32
+ "act_bit": 16,
33
+ "embed_bit": 16,
34
+ "act_sym": false,
35
+ "quant_config": null,
36
+ "generate_for_npu": false,
37
+ "skip_weight": false,
38
+ "omni_epochs": 20,
39
+ "omni_lr": 0.005,
40
+ "omni_wd": 0.0001,
41
+ "tie_word_embeddings": false
42
+ }
llm.mnn ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c55cc27d7e48c7c81513d6ec72f41df4cca38fc73cb4d97705bcc808d370e30
3
+ size 695576
llm.mnn.weight ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a0721394f97c7e276177e35b96182903cae4d237876600e84a1bd5cfd6643760
3
+ size 12885080106
llm_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "mistral",
3
+ "hidden_size": 5120,
4
+ "attention_mask": "float",
5
+ "attention_type": "full",
6
+ "is_mrope": false,
7
+ "jinja": {
8
+ "chat_template": "{%- set today = strftime_now(\"%Y-%m-%d\") %}\n{%- set default_system_message = \"You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.\\nYour knowledge base was last updated on 2023-10-01. The current date is \" + today + \".\\n\\nWhen you're not sure about some information, you say that you don't have the information and don't make up anything.\\nIf the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \\\"What are some good restaurants around me?\\\" => \\\"Where are you?\\\" or \\\"When is the next flight to Tokyo\\\" => \\\"Where do you travel from?\\\")\" %}\n\n{{- bos_token }}\n\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content'] %}\n {%- set loop_messages = messages[1:] %}\n{%- else %}\n {%- set system_message = default_system_message %}\n {%- set loop_messages = messages %}\n{%- endif %}\n{{- '[SYSTEM_PROMPT]' + system_message + '[/SYSTEM_PROMPT]' }}\n\n{%- for message in loop_messages %}\n {%- if message['role'] == 'user' %}\n {{- '[INST]' + message['content'] + '[/INST]' }}\n {%- elif message['role'] == 'system' %}\n {{- '[SYSTEM_PROMPT]' + message['content'] + '[/SYSTEM_PROMPT]' }}\n {%- elif message['role'] == 'assistant' %}\n {{- message['content'] + eos_token }}\n {%- else %}\n {{- raise_exception('Only user, system and assistant roles are supported!') }}\n {%- endif %}\n{%- endfor %}",
9
+ "bos": "<s>",
10
+ "eos": "</s>"
11
+ }
12
+ }
tokenizer.txt ADDED
The diff for this file is too large to render. See raw diff