spicyneuron commited on
Commit
19d5dd6
·
verified ·
1 Parent(s): 2e4b0f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -0
README.md CHANGED
@@ -9,3 +9,37 @@ base_model: zai-org/GLM-5.1
9
  tags:
10
  - mlx
11
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  tags:
10
  - mlx
11
  ---
12
+
13
+ [GLM 5.1](https://huggingface.co/zai-org/GLM-5.1) optimized for MLX.
14
+
15
+ # Usage
16
+
17
+ ```sh
18
+ # Start server at http://localhost:8080/chat/completions
19
+ uvx --from mlx-lm mlx_vlm.server \
20
+ --host 127.0.0.1 \
21
+ --port 8080 \
22
+ --model spicyneuron/GLM-5.1-MLX-2.9bit
23
+ ```
24
+
25
+ # Methodology
26
+
27
+ Quantized using a custom script inspired by Unsloth/AesSedai/ubergarm style mixed-precision GGUFs.
28
+ MLX quantization options differ than llama.cpp, but the principles are the same:
29
+
30
+ - Sensitive layers like MoE routing, attention, and output embeddings get higher precision
31
+ - More tolerant layers like MoE experts get lower precision
32
+
33
+ # Benchmarks
34
+
35
+ (WIP)
36
+
37
+ Tested with:
38
+
39
+ ```
40
+ mlx_lm.perplexity --sequence-length 2048 --seed 123
41
+ mlx_lm.benchmark --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
42
+ mlx_lm.evaluate --tasks hellaswag --seed 123 --num-shots 0 --limit 2000
43
+ mlx_lm.evaluate --tasks piqa --seed 123 --num-shots 0 --limit 2000
44
+ mlx_lm.evaluate --tasks winogrande --seed 123 --num-shots 0 --limit 2000
45
+ ```