File size: 6,003 Bytes
c94ca8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae909ec
 
 
 
 
 
 
 
 
8270847
ae909ec
 
 
 
 
c94ca8d
 
8320889
c94ca8d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
library_name: gguf
tags:
  - moe
  - expert-prefetch
  - madvise
  - llama-cpp
  - apple-silicon
  - cuda
  - on-device
license: mit
pipeline_tag: text-generation
---

# llama.cpp Expert Sniper β€” madvise prefetch for MoE inference

~65 lines of C++ that enables MoE models larger than RAM on llama.cpp.

Stock llama.cpp thrashes indefinitely. This build generates tokens.

## Results

| Hardware | RAM | Model | Stock llama.cpp | madvise build |
|----------|-----|-------|-----------------|---------------|
| M2 MacBook Air | 8 GB | Qwen3.5-35B-A3B IQ2_M (10.6 GB) | 0 tok/s (thrash) | **0.57 tok/s** |
| M2 MacBook Air | 8 GB | Same model, no-op callback only | 0 tok/s (thrash) | 0.46 tok/s |

On GPU machines with abundant RAM (A100 251GB, RTX 3090 31GB), stock llama.cpp is faster β€” the OS page cache handles it. madvise helps specifically when **system RAM < model size** and **layers are on CPU (ngl 0)**.

## How it works

MoE models activate 8 of 128+ experts per token. Consecutive tokens share ~87% of active experts. Stock llama.cpp uses mmap but the OS has no idea which expert pages are hot β€” it evicts them randomly, causing a page fault storm.

Our patch hooks llama.cpp's eval callback, intercepts every `ggml_mul_mat_id` operation, reads the router's top-k expert selection from `t->src[2]`, and calls `madvise(MADV_WILLNEED)` on each active expert's memory range. This tells the kernel which pages to prefetch before the compute needs them.

Zero allocation. Zero memcpy. Zero mutex. One syscall per expert slice.

## What we learned

**1. madvise beats LRU cache everywhere.** We first built a 460-line LRU cache. It was 2.4x slower than 15 lines of madvise (0.24 vs 0.57 tok/s on 8GB MacBook). The cache stole 5GB from the OS page cache for duplicate data. Don't fight the OS page cache β€” coach it.

**2. Even a no-op callback prevents thrashing.** Just hooking the eval callback and inspecting tensor pointers (without any madvise) produces 0.46 tok/s where stock produces zero. The callback inadvertently warms mmap pages through pointer inspection.

**3. Device pointer bug.** All prior GPU benchmarks were silently invalid β€” the callback dereferenced `t->src[2]->data` without checking `ggml_backend_buffer_is_host()`. On GPU layers this was a CUDA device pointer. Fixed.

**4. On abundant RAM, do nothing.** The OS page cache is remarkably good when it has enough room. Any intervention is pure overhead when RAM exceeds model size.

## Build

```bash
git clone https://github.com/walter-grace/mac-code
cd mac-code/research/expert-sniper/llama-cpp

# macOS (Metal)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target llama-server

# NVIDIA GPU (CUDA)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target llama-server
```

## Usage

```bash
# Enable madvise prefetch
./build/bin/llama-server \
  -m Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
  -ngl 0 \
  --expert-cache-size 1 \
  --port 8201

# Test no-op mode (isolate callback overhead)
EXPERT_CACHE_NOOP=1 ./build/bin/llama-server \
  -m model.gguf \
  -ngl 0 \
  --expert-cache-size 1 \
  --port 8201
```

## Files changed vs stock llama.cpp

**New files (~430 lines):**

| File | Purpose |
|------|---------|
| `src/llama-expert-cache-ctx.cpp` | Eval callback, madvise prefetch, tensor identification |
| `src/llama-expert-cache-ctx.h` | Context struct and declarations |
| `src/llama-expert-cache.cpp` | LRU cache (deprecated, retained for reference) |
| `src/llama-expert-cache.h` | Cache class definition |

**Patched files (~30 lines across 5 files):**

| File | Change |
|------|--------|
| `src/CMakeLists.txt` | Added new source files to build |
| `src/llama-context.h` | Added expert cache context member |
| `common/common.h` | Added `expert_cache_size` parameter |
| `common/common.cpp` | Cache init + eval callback registration |
| `common/arg.cpp` | `--expert-cache-size` CLI flag |

## The research journey

```
460-line LRU cache β†’ 0.24 tok/s (stole RAM from OS page cache)
15-line madvise    β†’ 0.57 tok/s (coached the OS page cache)
no-op callback     β†’ 0.46 tok/s (accidental page warming)

The cache was the experiment. madvise was the answer.
```

## Full three-way benchmark (8 GB MacBook Air)

| Config | tok/s | Mechanism |
|--------|-------|-----------|
| Stock llama.cpp | 0 (thrash) | OS blind LRU, no domain knowledge |
| No-op callback | 0.46 | Accidental page warming from tensor inspection |
| madvise prefetch | 0.57 | Explicit kernel prefetch hints |
| LRU cache (5 GB) | 0.24 | Duplicate data in user-space heap |

## Gemma 4-26B-A4B β€” MoE Sparsity Benchmark

Google Gemma 4 has 128 experts with top-8 routing (4B active of 26B total). Tested at multiple quantization levels on Apple Silicon:

| Hardware | Quant | Model size | RAM | Speed | Notes |
|----------|-------|-----------|-----|-------|-------|
| M2 MacBook Air | IQ2_M | 9.3 GB | 8 GB | **1.37 tok/s** | Model exceeds RAM, MoE sparsity prevents thrash |
| M4 Mac Mini | IQ2_M | 9.3 GB | 16 GB | **36.5 tok/s** | Fits in RAM, full GPU speed |
| M4 Mac Mini | Q4_K_M | 16.9 GB | 16 GB | **5.18 tok/s** | Exceeds RAM, still runs smoothly |
| M4 Mac Mini | Q8_0 | 26.9 GB | 16 GB | **0 tok/s (thrash)** | CPU_REPACK doubles memory to 51 GB, can't load |

All results: stock llama.cpp with mmap, no madvise. Canberra verified on all configs.

**Finding:** Gemma 4's low activation ratio (15.4%) lets the OS page cache handle memory pressure without explicit madvise. The madvise sniper is most valuable for denser MoE models (Qwen 35B) where the per-token working set overwhelms the page cache.

## Related

- **MLX Expert Sniper** (Apple Silicon, 5.4 tok/s on 35B): [huggingface.co/waltgrace/mlx-expert-sniper](https://huggingface.co/waltgrace/mlx-expert-sniper)
- **Full research + code**: [github.com/walter-grace/mac-code/tree/main/research/expert-sniper](https://github.com/walter-grace/mac-code/tree/main/research/expert-sniper)