Update README.md
Browse files
README.md
CHANGED
|
@@ -14,6 +14,31 @@ This model [TOTORONG/LongCat-Flash-3.5bits](https://huggingface.co/TOTORONG/Long
|
|
| 14 |
converted to MLX format from [meituan-longcat/LongCat-Flash-Chat](https://huggingface.co/meituan-longcat/LongCat-Flash-Chat)
|
| 15 |
using mlx-lm version **0.27.1**.
|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
## Use with mlx
|
| 18 |
|
| 19 |
```bash
|
|
|
|
| 14 |
converted to MLX format from [meituan-longcat/LongCat-Flash-Chat](https://huggingface.co/meituan-longcat/LongCat-Flash-Chat)
|
| 15 |
using mlx-lm version **0.27.1**.
|
| 16 |
|
| 17 |
+
|
| 18 |
+
Quantization policy (by module)
|
| 19 |
+
Module / Tensor name pattern Bits Notes
|
| 20 |
+
LayerNorms: *layernorm*, input_layernorm, post_attention_layernorm fp16 (not quantized) Kept full precision for stability; negligible size share.
|
| 21 |
+
Router: mlp.router.classifier.* 8b Conservative to preserve expert routing fidelity.
|
| 22 |
+
Embeddings: embed_tokens.* 8b Vocabulary quality & calibration.
|
| 23 |
+
LM head: lm_head.* 8b Output logits stability & calibration.
|
| 24 |
+
Self-Attention Q/K/V: `.self_attn.(q_a q_b kv_a(_with_mqa)?
|
| 25 |
+
Self-Attention O-proj: .self_attn.o_proj.weight 4b → 6b on selected layers Higher precision on early/late/periodic layers to reduce accumulation error.
|
| 26 |
+
Switch-MLP experts: `.mlp.switch_mlp.(up gate down)_proj.weight`
|
| 27 |
+
Experts (per-block): `.mlps.<idx>.(up gate down)_proj.weight`
|
| 28 |
+
Everything else low_bits fallback Uses the converter’s low_bits default if not matched above.
|
| 29 |
+
“Selected layers” (the precision bump mask)
|
| 30 |
+
A layer is considered early/late/periodic if its index i (from model.layers.i) satisfies:
|
| 31 |
+
i < num_layers // 8 or
|
| 32 |
+
i >= 7 * num_layers // 8 or
|
| 33 |
+
(i - num_layers // 8) % 3 == 2
|
| 34 |
+
These layers receive:
|
| 35 |
+
Q/K/V: 3b → 4b
|
| 36 |
+
O-proj: 4b → 6b
|
| 37 |
+
Experts (.mlps.<idx>.*): 2b → 3b
|
| 38 |
+
Switch-MLP remains 3b across all layers.
|
| 39 |
+
This mask preserves prompt-sensitivity (front) and output stability (tail), with a periodic boost to reduce worst-case error accumulation.
|
| 40 |
+
|
| 41 |
+
|
| 42 |
## Use with mlx
|
| 43 |
|
| 44 |
```bash
|