spicyneuron commited on
Commit
353e5e7
·
verified ·
1 Parent(s): 1dd92a9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -13
README.md CHANGED
@@ -9,10 +9,9 @@ tags:
9
  base_model: MiniMaxAI/MiniMax-M2.7
10
  ---
11
 
12
- [MiniMax-M2.7](MiniMaxAI/MiniMax-M2.7) optimized for MLX. A mixed-precision quant that balances speed, memory, and accuracy.
13
-
14
- - 4 bit baseline with important layers at 5, 6, 8, and BF16.
15
- -
16
 
17
  # Usage
18
 
@@ -24,14 +23,6 @@ uvx --from mlx-lm mlx_lm.server \
24
  --model spicyneuron/MiniMax-M2.7-MLX-4.9bit
25
  ```
26
 
27
- # Methodology
28
-
29
- Quantized with a [mlx-lm fork](https://github.com/ml-explore/mlx-lm/pull/922), drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs.
30
- MLX quantization options differ than llama.cpp, but the principles are the same:
31
-
32
- - Sensitive layers like MoE routing, attention, and output embeddings get higher precision
33
- - More tolerant layers like MoE experts get lower precision
34
-
35
  # Benchmarks
36
 
37
  metric | mlx-community_MiniMax-M2.7-4bit | baa-ai_MiniMax-M2.7-RAM-155GB-MLX | 4.9 bit (this model)
@@ -53,4 +44,12 @@ mlx_lm.benchmark --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
53
  mlx_lm.evaluate --tasks hellaswag --seed 123 --num-shots 0 --limit 2000
54
  mlx_lm.evaluate --tasks piqa --seed 123 --num-shots 0 --limit 2000
55
  mlx_lm.evaluate --tasks winogrande --seed 123 --num-shots 0 --limit 2000
56
- ```
 
 
 
 
 
 
 
 
 
9
  base_model: MiniMaxAI/MiniMax-M2.7
10
  ---
11
 
12
+ [MiniMax-M2.7](MiniMaxAI/MiniMax-M2.7) optimized for MLX.
13
+ A mixed-precision quant that balances speed, memory, and accuracy.
14
+ 4 bit baseline with important layers at 5, 6, 8, and BF16.
 
15
 
16
  # Usage
17
 
 
23
  --model spicyneuron/MiniMax-M2.7-MLX-4.9bit
24
  ```
25
 
 
 
 
 
 
 
 
 
26
  # Benchmarks
27
 
28
  metric | mlx-community_MiniMax-M2.7-4bit | baa-ai_MiniMax-M2.7-RAM-155GB-MLX | 4.9 bit (this model)
 
44
  mlx_lm.evaluate --tasks hellaswag --seed 123 --num-shots 0 --limit 2000
45
  mlx_lm.evaluate --tasks piqa --seed 123 --num-shots 0 --limit 2000
46
  mlx_lm.evaluate --tasks winogrande --seed 123 --num-shots 0 --limit 2000
47
+ ```
48
+
49
+ # Methodology
50
+
51
+ Quantized with a [mlx-lm fork](https://github.com/ml-explore/mlx-lm/pull/922), drawing inspiration from Unsloth/AesSedai/ubergarm style mixed-precision GGUFs.
52
+ MLX quantization options differ than llama.cpp, but the principles are the same:
53
+
54
+ - Sensitive layers like MoE routing, attention, and output embeddings get higher precision
55
+ - More tolerant layers like MoE experts get lower precision