Update README.md
Browse files
README.md
CHANGED
|
@@ -14,20 +14,20 @@ This model [TOTORONG/LongCat-Flash-3.5bits](https://huggingface.co/TOTORONG/Long
|
|
| 14 |
converted to MLX format from [meituan-longcat/LongCat-Flash-Chat](https://huggingface.co/meituan-longcat/LongCat-Flash-Chat)
|
| 15 |
using mlx-lm version **0.27.1**.
|
| 16 |
|
| 17 |
-
Quantized model with 3.516 bits per weight to fit M3 Ultra 256GB
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
A layer is considered early/late/periodic if its index i (from model.layers.i) satisfies:
|
| 21 |
-
i < num_layers // 8 or
|
| 22 |
-
i >= 7 * num_layers // 8 or
|
| 23 |
-
(i - num_layers // 8) % 3 == 2
|
| 24 |
-
|
| 25 |
-
These layers receive:
|
| 26 |
-
Q/K/V: 3b → 4b
|
| 27 |
-
O-proj: 4b → 6b
|
| 28 |
-
Experts (.mlps.<idx>.*): 2b → 3b
|
| 29 |
-
Switch-MLP remains 3b across all layers.
|
| 30 |
-
This mask preserves prompt-sensitivity (front) and output stability (tail), with a periodic boost to reduce worst-case error accumulation.
|
| 31 |
|
| 32 |
|
| 33 |
## Use with mlx
|
|
|
|
| 14 |
converted to MLX format from [meituan-longcat/LongCat-Flash-Chat](https://huggingface.co/meituan-longcat/LongCat-Flash-Chat)
|
| 15 |
using mlx-lm version **0.27.1**.
|
| 16 |
|
| 17 |
+
#Quantized model with 3.516 bits per weight to fit M3 Ultra 256GB
|
| 18 |
+
|
| 19 |
+
#“Selected layers” (the precision bump mask)
|
| 20 |
+
#A layer is considered early/late/periodic if its index i (from model.layers.i) satisfies:
|
| 21 |
+
#i < num_layers // 8 or
|
| 22 |
+
#i >= 7 * num_layers // 8 or
|
| 23 |
+
#(i - num_layers // 8) % 3 == 2
|
| 24 |
+
|
| 25 |
+
#These layers receive:
|
| 26 |
+
#Q/K/V: 3b → 4b
|
| 27 |
+
#O-proj: 4b → 6b
|
| 28 |
+
#Experts (.mlps.<idx>.*): 2b → 3b
|
| 29 |
+
#Switch-MLP remains 3b across all layers.
|
| 30 |
+
#This mask preserves prompt-sensitivity (front) and output stability (tail), with a periodic boost to reduce worst-case error accumulation.
|
| 31 |
|
| 32 |
|
| 33 |
## Use with mlx
|