waltgrace commited on
Commit
7400275
Β·
verified Β·
1 Parent(s): 3f56a7b

docs: initial README

Browse files
Files changed (1) hide show
  1. README.md +138 -117
README.md CHANGED
@@ -1,157 +1,178 @@
1
  ---
2
- library_name: mlx
3
  tags:
4
- - moe
5
- - apple-silicon
6
- - expert-sniping
7
  - mlx
 
 
 
 
 
 
8
  - inference
9
- license: mit
 
 
 
10
  ---
11
 
12
- # CLI Agent β€” `mlx-expert-sniper`
13
-
14
- Pip-installable CLI that wraps the Expert Sniper research into a production tool.
15
- Run MoE models larger than your RAM on Apple Silicon.
16
 
17
- ## Verified Results (M4 Mac Mini, 16 GB)
18
 
19
- | Model | Size | Experts | Standard mlx_lm | Sniper tok/s | Cache hit | RAM |
20
- |-------|------|---------|-----------------|--------------|-----------|-----|
21
- | Qwen3.5-35B-A3B | 19.5 GB | 256/layer | OOM | **5.37 tok/s** | 92.0% | 8.7 GB |
22
- | Qwen3-30B-A3B | 17.2 GB | 128/layer | OOM | **4.29 tok/s** | 90.4% | 8.7 GB |
23
- | **Gemma 4-26B-A4B** | 15.6 GB | 128/layer | OOM | **4.15 tok/s** | 95.8% | 7.8 GB |
24
 
25
- All benchmarks: M4 Mac Mini 16 GB, 5 varied prompts, greedy decoding.
26
-
27
- **35B** (6.9x speedup from software alone):
28
- - Cache-aware routing bias=1.0 (biases router toward cached experts before softmax)
29
- - Co-activation predictive prefetch (predicts next layer's experts)
30
- - Right-sized LRU cache (4000 experts, 8.7 GB)
31
- - TTFT: 2.9s, all answers verified correct from cold start
32
- - REAP dead-expert masking tested (45% dead) but NOT used β€” static masking breaks factual knowledge on topics outside calibration set
33
-
34
- **30B**: right-sized LRU + co-activation prefetch. REAP/bias not yet applied.
 
 
 
 
 
 
 
 
 
35
 
36
- **Gemma 4-26B-A4B** (NEW):
37
- - Custom Gemma 4 model class (sliding/full attention hybrid, layer scalars, dual layernorms)
38
- - Mixed quantization: experts 4-bit, dense MLP and router 8-bit (matches mlx-community format)
39
- - Cache-aware routing bias=1.5 + co-activation prefetch (95.8% hit rate)
40
- - Source: `mlx-community/gemma-4-26b-a4b-it-4bit`
41
 
42
- ## Supported Models
43
 
44
- | Model | Size | Experts | tok/s (M4 16GB) | Status |
45
- |-------|------|---------|-----------------|--------|
46
- | Qwen3.5-35B-A3B | 19.5 GB | 256/layer | 5.37 tok/s | Verified |
47
- | Qwen3-30B-A3B | 17.2 GB | 128/layer | 4.29 tok/s | Verified |
48
- | **Gemma 4-26B-A4B** | 15.6 GB | 128/layer | **4.15 tok/s** | Verified |
49
 
50
- More models coming. To request a model, open an issue on [GitHub](https://github.com/walter-grace/mac-code).
51
 
52
- ### Memory Bandwidth Scaling
53
 
54
- MoE inference is bandwidth-bound. Expected speeds on different Apple Silicon Macs:
55
 
56
- | Mac | Memory BW | Qwen 35B est. | Gemma 4-26B est. |
57
- |-----|-----------|---------------|------------------|
58
- | M2 Mac Mini | 100 GB/s | ~4.5 tok/s | ~3.5 tok/s |
59
- | **M4 Mac Mini** | **120 GB/s** | **5.37 tok/s** βœ“ | **4.15 tok/s** βœ“ |
60
- | M2 Pro Mac Mini | 200 GB/s | ~8-10 tok/s | ~7-8 tok/s |
61
- | M4 Pro Mac Mini | 273 GB/s | ~12-14 tok/s | ~10-11 tok/s |
62
- | M2 Max Studio | 400 GB/s | ~16-20 tok/s | ~14-17 tok/s |
63
- | M4 Max MacBook Pro | 546 GB/s | ~22-28 tok/s | ~18-23 tok/s |
64
- | M2 Ultra Studio | 800 GB/s | ~30-40 tok/s | ~25-32 tok/s |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
- ### Hardware Requirements
67
 
68
- | Mac | RAM | What you can run |
69
- |-----|-----|-----------------|
70
- | Any Apple Silicon | 8 GB | llama.cpp path only (0.57 tok/s) |
71
- | M1/M2/M3/M4 | 16 GB | Qwen3.5-35B-A3B at 5.4 tok/s |
72
- | M1/M2/M3/M4 Pro/Max | 32 GB+ | Larger models, faster speeds |
73
 
74
- ### Routing Bias: Universal Sweet Spot
 
 
 
 
 
75
 
76
- Tested across both models β€” bias=1.0 is the safe maximum regardless of expert count:
77
 
78
- | Model | Experts | No bias | bias=1.0 | Speedup | bias=1.5 |
79
- |-------|---------|---------|----------|---------|----------|
80
- | Qwen3.5-35B-A3B | 256/layer | 2.42 tok/s | **5.37 tok/s** | 2.2x | Quality degrades |
81
- | Qwen3-30B-A3B | 128/layer | 3.34 tok/s | **4.29 tok/s** | 1.3x | Quality degrades |
 
 
82
 
83
- At bias=1.5, both models fail the same question ("What is the capital of Australia?" β€” answers "no capital" instead of "Canberra"). The safe threshold is a property of MoE routing, not expert count.
84
 
85
- `mlx-sniper calibrate` finds this automatically.
 
 
 
 
86
 
87
- ### Limitations
88
- - MoE architectures only (no dense models)
89
- - Qwen models only for now (architecture-specific engine)
90
- - Apple Silicon only for MLX path (llama.cpp path works on CUDA too)
91
 
92
- ## Install
93
 
94
- ```bash
95
- cd research/expert-sniper/cli-agent
96
- pip install -e .
 
 
 
 
 
 
 
 
 
 
97
  ```
98
 
99
- ## Usage
100
-
101
- ```bash
102
- # Preprocess model (one-time, ~17 GB on disk)
103
- mlx-sniper preprocess <hf-model-dir> -o ~/models/qwen3-30b
104
-
105
- # Or use the streaming preprocessor (downloads one shard at a time):
106
- python3 stream_preprocess.py
107
-
108
- # Generate
109
- mlx-sniper run ~/models/qwen3-30b -p "What is 2+2?" -v
110
-
111
- # Interactive chat
112
- mlx-sniper chat ~/models/qwen3-30b
113
-
114
- # OpenAI-compatible server
115
- mlx-sniper server ~/models/qwen3-30b --port 8899
116
 
117
- # Profile performance
118
- mlx-sniper profile ~/models/qwen3-30b --tokens 100
119
 
120
- # Show model info
121
- mlx-sniper info ~/models/qwen3-30b
 
 
 
 
 
 
 
 
 
 
122
  ```
123
 
124
- ## Files
125
-
126
- | File | Purpose |
127
- |------|---------|
128
- | `src/mlx_expert_sniper/sniper.py` | Core engine β€” the proven forward pass with expert sniping |
129
- | `src/mlx_expert_sniper/cache.py` | Per-expert LRU cache + pread/F_NOCACHE binary reader |
130
- | `src/mlx_expert_sniper/config.py` | SniperConfig dataclass |
131
- | `src/mlx_expert_sniper/generate.py` | High-level generate / stream_generate |
132
- | `src/mlx_expert_sniper/preprocess.py` | Convert HuggingFace model β†’ sniper binary format |
133
- | `src/mlx_expert_sniper/server.py` | OpenAI-compatible HTTP server |
134
- | `src/mlx_expert_sniper/profile.py` | Per-token profiling tools |
135
- | `src/mlx_expert_sniper/cli.py` | `mlx-sniper` CLI entry point |
136
- | `stream_preprocess.py` | Streaming preprocessor (downloads one shard at a time) |
137
 
138
- ## How It Relates to the Research
 
 
 
 
139
 
140
- This is a packaged version of the same forward pass in `../qwen3_agent.py`:
141
- - `cache.py` = extracted from `../expert_io.py`
142
- - `sniper.py:forward_token()` = same loop as `Qwen3SniperEngine.forward_token()`
143
- - `preprocess.py` = extracted from `../convert_qwen3_30b.py`
144
- - `server.py` = extracted from `../sniper_server.py`
145
 
146
- No algorithmic changes β€” same pread + F_NOCACHE + per-expert LRU + gather_qmm.
147
 
148
- ## Python API
149
 
150
- ```python
151
- from mlx_expert_sniper import SniperEngine
152
 
153
- engine = SniperEngine.from_dir("~/models/qwen3-30b")
154
 
155
- for token in engine.generate("Write a haiku about AI"):
156
- print(token, end="", flush=True)
 
 
 
 
 
157
  ```
 
1
  ---
2
+ license: apache-2.0
3
  tags:
 
 
 
4
  - mlx
5
+ - apple-silicon
6
+ - moe
7
+ - mixture-of-experts
8
+ - vision-language
9
+ - gemma
10
+ - falcon-perception
11
  - inference
12
+ language:
13
+ - en
14
+ library_name: mlx
15
+ pipeline_tag: image-text-to-text
16
  ---
17
 
18
+ # MLX Expert Sniper
 
 
 
19
 
20
+ **Run 26B+ MoE vision-language models on a 16 GB Apple Silicon Mac.**
21
 
22
+ A single-machine inference engine that streams Mixture-of-Experts weights from SSD on demand. The active resident set stays under 4 GB even though the full model is 13–17 GB. Confirmed working with Gemma 4-26B-A4B (vision) + Falcon Perception (grounded segmentation) on a 16 GB M4 Mac Mini.
 
 
 
 
23
 
24
+ ```
25
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
26
+ β”‚ http://localhost:8500 web chat + REST API β”‚
27
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
28
+ β”‚ /api/chat_vision chained Gemma β†’ Falcon β”‚
29
+ β”‚ /api/falcon direct grounded segm β”‚
30
+ β”‚ /api/turbo_chat Gemma vision β†’ Qwen brain β”‚
31
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
32
+ β”‚
33
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
34
+ β–Ό β–Ό
35
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
36
+ β”‚ Gemma 4-26B β”‚ β”‚ Falcon Perception β”‚
37
+ β”‚ A4B vision β”‚ β”‚ 0.6B segmentation β”‚
38
+ β”‚ ~3 GB resident β”‚ β”‚ ~1.5 GB resident β”‚
39
+ β”‚ via Sniper β”‚ β”‚ via mlx-vlm β”‚
40
+ β”‚ (SSD streaming)β”‚ β”‚ β”‚
41
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
42
+ ```
43
 
44
+ ## Why this exists
 
 
 
 
45
 
46
+ MoE models only activate ~3–15% of their parameters per token. The other 85–97% sit idle in RAM. **Expert Sniper unloads cold experts to SSD and pages them in on demand**, so a 26B model behaves like a 4B model for memory pressure.
47
 
48
+ | Setup | Resident RAM | Quality | Speed (M4 Mac Mini) |
49
+ |---|---|---|---|
50
+ | Gemma 4-26B 4-bit, vanilla MLX | ~13 GB | full 4-bit | 4 tok/s β€” but OOMs on 16 GB Macs |
51
+ | **Gemma 4-26B 4-bit + Sniper** | **~3 GB** | **full 4-bit** | **4.15 tok/s** |
52
+ | Gemma 4-26B 2-bit (hypothetical) | ~6.5 GB | -5–10% perplexity | n/a |
53
 
54
+ The sniper-streamed 4-bit gives you smaller RAM than vanilla 2-bit *with no quality loss*.
55
 
56
+ ## Install (from scratch on a fresh Apple Silicon Mac)
57
 
58
+ Requires macOS 14+, M1/M2/M3/M4 (16 GB+ RAM recommended), Python 3.10+, ~30 GB free disk.
59
 
60
+ ```bash
61
+ # 1. Clone the deploy repo
62
+ git clone https://github.com/walter-grace/mac-code
63
+ cd mac-code/research/expert-sniper/distributed
64
+ python3 -m venv venv && source venv/bin/activate
65
+ pip install -e .
66
+ pip install mlx mlx-vlm fastapi uvicorn pillow huggingface_hub
67
+
68
+ # 2. Download stock Gemma 4 (one time, ~13 GB)
69
+ huggingface-cli download mlx-community/gemma-4-26b-a4b-it-4bit \
70
+ --local-dir ~/models/gemma4-source
71
+
72
+ # 3. Split for SSD streaming (one time, ~5 minutes)
73
+ python3 split_gemma4.py \
74
+ --input ~/models/gemma4-source \
75
+ --output ~/models/gemma4-stream
76
+ # Produces: ~/models/gemma4-stream/{pinned.safetensors, bin/layer_XX.bin}
77
+
78
+ # 4. Falcon Perception downloads automatically on first run
79
+ # (~1.5 GB, from tiiuae/Falcon-Perception)
80
+
81
+ # 5. Launch
82
+ python3 -m mac_tensor.cli ui --vision --falcon \
83
+ --stream-dir ~/models/gemma4-stream \
84
+ --source-dir ~/models/gemma4-source \
85
+ --port 8500
86
+ ```
87
 
88
+ Open `http://localhost:8500` in a browser. Drop an image, ask Gemma to describe it, then click **Ground** for Falcon to outline objects precisely.
89
 
90
+ ## Three modes β€” pick the right flag
 
 
 
 
91
 
92
+ | Flag combo | What loads | Resident RAM | Use for |
93
+ |---|---|---|---|
94
+ | `--vision --falcon` | Gemma 4 + Falcon | ~5 GB | Full chained vision agent |
95
+ | `--vision` | Gemma 4 only | ~3 GB | Vision chat without segmentation |
96
+ | `--falcon-only` | Falcon only, no Gemma | ~1.5 GB | Batch labeling pipelines |
97
+ | `--nodes …` | distributed text-only | ~1.5 GB coordinator | Multi-Mac MoE chat |
98
 
99
+ ## REST endpoints
100
 
101
+ | Endpoint | Purpose |
102
+ |---|---|
103
+ | `GET /api/info` | `{model, vision, falcon, swarm_leader}` capability discovery |
104
+ | `POST /api/chat_vision` | Chained Gemma β†’ Falcon vision agent (multipart, SSE) |
105
+ | `POST /api/falcon` | Direct grounded segmentation (multipart) |
106
+ | `POST /api/turbo_chat` | Gemma vision encode + small-LLM reasoning chain |
107
 
108
+ ## Verified hardware
109
 
110
+ | Mac | Cold start | First vision call | Notes |
111
+ |---|---|---|---|
112
+ | **M4 Mac Mini, 16 GB** | ~30s loading | ~14s | Confirmed working in production tonight |
113
+ | M2 Mac Mini, 16 GB | ~45s | ~20s | Slower memory bandwidth |
114
+ | M3/M4 Pro/Max | faster | faster | Untested but expected to scale linearly with bandwidth |
115
 
116
+ ## What's chained vision agent (the killer feature)
 
 
 
117
 
118
+ When you load both Gemma 4 + Falcon Perception, the server exposes a chained reasoning loop:
119
 
120
+ ```
121
+ You: "Is the blue player offside in this image?"
122
+ β”‚
123
+ β–Ό
124
+ Gemma: "I need to find players, identify the second-to-last defender,
125
+ and compare the attacker's position. Let me ground the players."
126
+ β”‚ tool_call
127
+ β–Ό
128
+ Falcon: β†’ returns 22 player bboxes with centroids
129
+ β”‚
130
+ β–Ό
131
+ Gemma: "Sorting by x-coordinate... the second-to-last defender is at x=0.65,
132
+ the attacker is at x=0.71. The attacker is past the defender β†’ offside."
133
  ```
134
 
135
+ This pattern works for any open-ended visual reasoning οΏ½οΏ½ the VLM picks what to look for, the segmenter measures it precisely.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
 
137
+ ## Architecture
 
138
 
139
+ ```
140
+ research/expert-sniper/distributed/
141
+ β”œβ”€β”€ mac_tensor/ ← the deploy package
142
+ β”‚ β”œβ”€β”€ cli.py ← `mac-tensor` CLI
143
+ β”‚ β”œβ”€β”€ server.py ← FastAPI app
144
+ β”‚ β”œβ”€β”€ vision_engine.py ← Gemma 4 sniper wrapper
145
+ β”‚ β”œβ”€β”€ falcon_perception.py ← Falcon mlx-vlm wrapper
146
+ β”‚ β”œβ”€β”€ agent.py ← chained vision agent loop
147
+ β”‚ └── static/chat.html ← web UI
148
+ β”œβ”€β”€ split_gemma4.py ← weight-splitting tool
149
+ β”œβ”€β”€ README.md ← this file
150
+ └── ...
151
  ```
152
 
153
+ ## Credits
 
 
 
 
 
 
 
 
 
 
 
 
154
 
155
+ - **Gemma 4-26B-A4B** by [Google DeepMind](https://huggingface.co/google) β€” Apache 2.0
156
+ - **Falcon Perception** by [TII](https://huggingface.co/tiiuae/Falcon-Perception) β€” Apache 2.0
157
+ - **MLX** by [Apple Machine Learning Research](https://github.com/ml-explore/mlx) β€” MIT
158
+ - **mlx-vlm** by [Prince Canuma](https://github.com/Blaizzy/mlx-vlm) β€” MIT
159
+ - Expert Sniper streaming engine by Walter Grace
160
 
161
+ ## License
 
 
 
 
162
 
163
+ Apache 2.0 (the deploy code). Model weights are subject to their respective licenses (Gemma Terms of Use, Falcon Apache 2.0).
164
 
165
+ ## Star the GitHub repo
166
 
167
+ 🌟 https://github.com/walter-grace/mac-code
 
168
 
169
+ ## Citation
170
 
171
+ ```bibtex
172
+ @software{mlx_expert_sniper_2026,
173
+ author = {Walter Grace},
174
+ title = {MLX Expert Sniper: Streaming MoE inference for Apple Silicon},
175
+ year = {2026},
176
+ url = {https://github.com/walter-grace/mac-code},
177
+ }
178
  ```