waltgrace commited on
Commit
ae909ec
·
verified ·
1 Parent(s): 8320889

Add Gemma 4 benchmark results

Browse files
Files changed (1) hide show
  1. README.md +15 -0
README.md CHANGED
@@ -118,6 +118,21 @@ The cache was the experiment. madvise was the answer.
118
  | madvise prefetch | 0.57 | Explicit kernel prefetch hints |
119
  | LRU cache (5 GB) | 0.24 | Duplicate data in user-space heap |
120
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
  ## Related
122
 
123
  - **MLX Expert Sniper** (Apple Silicon, 5.4 tok/s on 35B): [huggingface.co/waltgrace/mlx-expert-sniper](https://huggingface.co/waltgrace/mlx-expert-sniper)
 
118
  | madvise prefetch | 0.57 | Explicit kernel prefetch hints |
119
  | LRU cache (5 GB) | 0.24 | Duplicate data in user-space heap |
120
 
121
+ ## Gemma 4-26B-A4B — MoE Sparsity Benchmark
122
+
123
+ Google Gemma 4 has 128 experts with top-8 routing (4B active of 26B total). Tested at multiple quantization levels on Apple Silicon:
124
+
125
+ | Hardware | Quant | Model size | RAM | Speed | Notes |
126
+ |----------|-------|-----------|-----|-------|-------|
127
+ | M2 MacBook Air | IQ2_M | 9.3 GB | 8 GB | **1.37 tok/s** | Model exceeds RAM, MoE sparsity prevents thrash |
128
+ | M4 Mac Mini | IQ2_M | 9.3 GB | 16 GB | **36.5 tok/s** | Fits in RAM, full GPU speed |
129
+ | M4 Mac Mini | Q4_K_M | 16.9 GB | 16 GB | **5.18 tok/s** | Exceeds RAM, still runs smoothly |
130
+ | M4 Mac Mini | Q8_0 | ~27 GB | 16 GB | *testing* | 1.7x oversubscription |
131
+
132
+ All results: stock llama.cpp with mmap, no madvise. Canberra verified on all configs.
133
+
134
+ **Finding:** Gemma 4's low activation ratio (15.4%) lets the OS page cache handle memory pressure without explicit madvise. The madvise sniper is most valuable for denser MoE models (Qwen 35B) where the per-token working set overwhelms the page cache.
135
+
136
  ## Related
137
 
138
  - **MLX Expert Sniper** (Apple Silicon, 5.4 tok/s on 35B): [huggingface.co/waltgrace/mlx-expert-sniper](https://huggingface.co/waltgrace/mlx-expert-sniper)