spicyneuron commited on
Commit
cefeab8
·
verified ·
1 Parent(s): cea20af

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -9
README.md CHANGED
@@ -10,7 +10,13 @@ tags:
10
  - mlx
11
  ---
12
 
13
- [GLM 5.1](https://huggingface.co/zai-org/GLM-5.1) optimized for MLX. A mixed-precision quant that balances speed, memory, and accuracy.
 
 
 
 
 
 
14
 
15
  # Usage
16
 
@@ -22,14 +28,6 @@ uvx --from mlx-lm mlx_lm.server \
22
  --model spicyneuron/GLM-5.1-MLX-2.9bit
23
  ```
24
 
25
- # Methodology
26
-
27
- Quantized with a [mlx-lm fork](https://github.com/ml-explore/mlx-lm/pull/922), drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs.
28
- MLX quantization options differ than llama.cpp, but the principles are the same:
29
-
30
- - Sensitive layers like MoE routing, attention, and output embeddings get higher precision
31
- - More tolerant layers like MoE experts get lower precision
32
-
33
  # Benchmarks
34
 
35
  metric | baa-ai/GLM-5.1-RAM-270GB-MLX | 2.9bit (this model)
@@ -52,3 +50,13 @@ mlx_lm.evaluate --tasks hellaswag --seed 123 --num-shots 0 --limit 2000
52
  mlx_lm.evaluate --tasks piqa --seed 123 --num-shots 0 --limit 2000
53
  mlx_lm.evaluate --tasks winogrande --seed 123 --num-shots 0 --limit 2000
54
  ```
 
 
 
 
 
 
 
 
 
 
 
10
  - mlx
11
  ---
12
 
13
+ [GLM 5.1](https://huggingface.co/zai-org/GLM-5.1) optimized to run _comfortably_
14
+ on a Mac Studio M3 512. This is the smaller, compact version. Quality-first
15
+ version [here](https://huggingface.co/spicyneuron/GLM-5.1-MLX-3.6bit).
16
+
17
+ - A mixed-precision quant that balances speed, memory, and accuracy.
18
+ - 3-bit baseline with important layers at 4, 8 and BF16.
19
+ - Fits into ~280 GB memory, leaving plenty of room to run parallel models (ex: Minimax M2.7, Qwen 3.6 35B).
20
 
21
  # Usage
22
 
 
28
  --model spicyneuron/GLM-5.1-MLX-2.9bit
29
  ```
30
 
 
 
 
 
 
 
 
 
31
  # Benchmarks
32
 
33
  metric | baa-ai/GLM-5.1-RAM-270GB-MLX | 2.9bit (this model)
 
50
  mlx_lm.evaluate --tasks piqa --seed 123 --num-shots 0 --limit 2000
51
  mlx_lm.evaluate --tasks winogrande --seed 123 --num-shots 0 --limit 2000
52
  ```
53
+
54
+ # Methodology
55
+
56
+ Quantized with a [mlx-lm fork](https://github.com/ml-explore/mlx-lm/pull/922),
57
+ drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs.
58
+ MLX quantization options differ from llama.cpp, but the principles are the
59
+ same:
60
+
61
+ - Sensitive layers like MoE routing, attention, and output embeddings get higher precision
62
+ - More tolerant layers like MoE experts get lower precision