spicyneuron commited on
Commit
5c5484d
·
verified ·
1 Parent(s): 346b1f8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -0
README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ library_name: mlx
6
+ license: mit
7
+ pipeline_tag: text-generation
8
+ base_model: zai-org/GLM-5.1
9
+ tags:
10
+ - mlx
11
+ ---
12
+
13
+ (Uploading...)
14
+
15
+ [GLM 5.1](https://huggingface.co/zai-org/GLM-5.1) optimized to run _comfortably_
16
+ on a Mac Studio M3 512. This is the quality-first version. Smaller, compact
17
+ version [here](https://huggingface.co/spicyneuron/GLM-5.1-MLX-2.9bit).
18
+
19
+ - A mixed-precision quant that balances speed, memory, and accuracy.
20
+ - 3-bit baseline with important layers at 4, 8 and BF16.
21
+ - Fits into ~350 GB memory, leaving plenty of room to run parallel models (ex: Minimax M2.7, Qwen 3.6 35B).
22
+
23
+ # Usage
24
+
25
+ ```sh
26
+ # Start server at http://localhost:8080/chat/completions
27
+ uvx --from mlx-lm mlx_lm.server \
28
+ --host 127.0.0.1 \
29
+ --port 8080 \
30
+ --model spicyneuron/GLM-5.1-MLX-3.6bit
31
+ ```
32
+
33
+ # Benchmarks
34
+
35
+ metric | baa-ai/GLM-5.1-RAM-270GB-MLX | 2.9bit
36
+ --- | --- | ---
37
+ bpw | 3.1096 | 2.9064
38
+ peak memory (1024/512) | 291.257 | 272.358
39
+ prompt tok/s (1024) | 194.958 ± 0.075 | 194.216 ± 0.167
40
+ gen tok/s (512) | 21.381 ± 0.050 | 19.527 ± 0.035
41
+ perplexity | 4.780 ± 0.020 | 4.118 ± 0.016
42
+ hellaswag | 0.546 ± 0.011 | 0.59 ± 0.011
43
+ piqa | 0.776 ± 0.01 | 0.794 ± 0.009
44
+ winogrande | 0.668 ± 0.013 | 0.695 ± 0.013
45
+
46
+ Tested on a Mac Studio M3 Ultra with:
47
+
48
+ ```
49
+ mlx_lm.perplexity --sequence-length 2048 --seed 123
50
+ mlx_lm.benchmark --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
51
+ mlx_lm.evaluate --tasks hellaswag --seed 123 --num-shots 0 --limit 2000
52
+ mlx_lm.evaluate --tasks piqa --seed 123 --num-shots 0 --limit 2000
53
+ mlx_lm.evaluate --tasks winogrande --seed 123 --num-shots 0 --limit 2000
54
+ ```
55
+
56
+ # Methodology
57
+
58
+ Quantized with a [mlx-lm fork](https://github.com/ml-explore/mlx-lm/pull/922),
59
+ drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs.
60
+ MLX quantization options differ from llama.cpp, but the principles are the
61
+ same:
62
+
63
+ - Sensitive layers like MoE routing, attention, and output embeddings get higher precision
64
+ - More tolerant layers like MoE experts get lower precision