waltgrace
/

llama-cpp-expert-sniper

Text Generation

Mixture of Experts

expert-prefetch

Model card Files Files and versions

waltgrace commited on 12 days ago

Commit

ae909ec

·

verified ·

1 Parent(s): 8320889

Add Gemma 4 benchmark results

Files changed (1) hide show

README.md +15 -0

README.md CHANGED Viewed

@@ -118,6 +118,21 @@ The cache was the experiment. madvise was the answer.
 | madvise prefetch | 0.57 | Explicit kernel prefetch hints |
 | LRU cache (5 GB) | 0.24 | Duplicate data in user-space heap |
 ## Related
 - **MLX Expert Sniper** (Apple Silicon, 5.4 tok/s on 35B): [huggingface.co/waltgrace/mlx-expert-sniper](https://huggingface.co/waltgrace/mlx-expert-sniper)

 | madvise prefetch | 0.57 | Explicit kernel prefetch hints |
 | LRU cache (5 GB) | 0.24 | Duplicate data in user-space heap |
+## Gemma 4-26B-A4B — MoE Sparsity Benchmark
+Google Gemma 4 has 128 experts with top-8 routing (4B active of 26B total). Tested at multiple quantization levels on Apple Silicon:
+| Hardware | Quant | Model size | RAM | Speed | Notes |
+|----------|-------|-----------|-----|-------|-------|
+| M2 MacBook Air | IQ2_M | 9.3 GB | 8 GB | **1.37 tok/s** | Model exceeds RAM, MoE sparsity prevents thrash |
+| M4 Mac Mini | IQ2_M | 9.3 GB | 16 GB | **36.5 tok/s** | Fits in RAM, full GPU speed |
+| M4 Mac Mini | Q4_K_M | 16.9 GB | 16 GB | **5.18 tok/s** | Exceeds RAM, still runs smoothly |
+| M4 Mac Mini | Q8_0 | ~27 GB | 16 GB | *testing* | 1.7x oversubscription |
+All results: stock llama.cpp with mmap, no madvise. Canberra verified on all configs.
+**Finding:** Gemma 4's low activation ratio (15.4%) lets the OS page cache handle memory pressure without explicit madvise. The madvise sniper is most valuable for denser MoE models (Qwen 35B) where the per-token working set overwhelms the page cache.
 ## Related
 - **MLX Expert Sniper** (Apple Silicon, 5.4 tok/s on 35B): [huggingface.co/waltgrace/mlx-expert-sniper](https://huggingface.co/waltgrace/mlx-expert-sniper)