spicyneuron commited on
Commit
1fd8a74
·
verified ·
1 Parent(s): 7b96841

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -10,13 +10,13 @@ tags:
10
  - mlx
11
  ---
12
 
13
- [GLM 5.1](https://huggingface.co/zai-org/GLM-5.1) optimized for MLX.
14
 
15
  # Usage
16
 
17
  ```sh
18
  # Start server at http://localhost:8080/chat/completions
19
- uvx --from mlx-lm mlx_vlm.server \
20
  --host 127.0.0.1 \
21
  --port 8080 \
22
  --model spicyneuron/GLM-5.1-MLX-2.9bit
@@ -24,7 +24,7 @@ uvx --from mlx-lm mlx_vlm.server \
24
 
25
  # Methodology
26
 
27
- Quantized using a custom script inspired by Unsloth/AesSedai/ubergarm style mixed-precision GGUFs.
28
  MLX quantization options differ than llama.cpp, but the principles are the same:
29
 
30
  - Sensitive layers like MoE routing, attention, and output embeddings get higher precision
 
10
  - mlx
11
  ---
12
 
13
+ [GLM 5.1](https://huggingface.co/zai-org/GLM-5.1) optimized for MLX. Best balance of speed, memory, and quality.
14
 
15
  # Usage
16
 
17
  ```sh
18
  # Start server at http://localhost:8080/chat/completions
19
+ uvx --from mlx-lm mlx_lm.server \
20
  --host 127.0.0.1 \
21
  --port 8080 \
22
  --model spicyneuron/GLM-5.1-MLX-2.9bit
 
24
 
25
  # Methodology
26
 
27
+ Quantized with a [mlx-lm fork](https://github.com/ml-explore/mlx-lm/pull/922), drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs.
28
  MLX quantization options differ than llama.cpp, but the principles are the same:
29
 
30
  - Sensitive layers like MoE routing, attention, and output embeddings get higher precision