docs: initial README
Browse files
README.md
CHANGED
|
@@ -1,157 +1,178 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
tags:
|
| 4 |
-
- moe
|
| 5 |
-
- apple-silicon
|
| 6 |
-
- expert-sniping
|
| 7 |
- mlx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
- inference
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
#
|
| 13 |
-
|
| 14 |
-
Pip-installable CLI that wraps the Expert Sniper research into a production tool.
|
| 15 |
-
Run MoE models larger than your RAM on Apple Silicon.
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|-------|------|---------|-----------------|--------------|-----------|-----|
|
| 21 |
-
| Qwen3.5-35B-A3B | 19.5 GB | 256/layer | OOM | **5.37 tok/s** | 92.0% | 8.7 GB |
|
| 22 |
-
| Qwen3-30B-A3B | 17.2 GB | 128/layer | OOM | **4.29 tok/s** | 90.4% | 8.7 GB |
|
| 23 |
-
| **Gemma 4-26B-A4B** | 15.6 GB | 128/layer | OOM | **4.15 tok/s** | 95.8% | 7.8 GB |
|
| 24 |
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
| 37 |
-
- Custom Gemma 4 model class (sliding/full attention hybrid, layer scalars, dual layernorms)
|
| 38 |
-
- Mixed quantization: experts 4-bit, dense MLP and router 8-bit (matches mlx-community format)
|
| 39 |
-
- Cache-aware routing bias=1.5 + co-activation prefetch (95.8% hit rate)
|
| 40 |
-
- Source: `mlx-community/gemma-4-26b-a4b-it-4bit`
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
-
|
|
| 45 |
-
|---
|
| 46 |
-
|
|
| 47 |
-
|
|
| 48 |
-
|
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
##
|
| 53 |
|
| 54 |
-
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
-
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|-----|-----|-----------------|
|
| 70 |
-
| Any Apple Silicon | 8 GB | llama.cpp path only (0.57 tok/s) |
|
| 71 |
-
| M1/M2/M3/M4 | 16 GB | Qwen3.5-35B-A3B at 5.4 tok/s |
|
| 72 |
-
| M1/M2/M3/M4 Pro/Max | 32 GB+ | Larger models, faster speeds |
|
| 73 |
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
-
|
| 77 |
|
| 78 |
-
|
|
| 79 |
-
|---
|
| 80 |
-
|
|
| 81 |
-
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
-
##
|
| 88 |
-
- MoE architectures only (no dense models)
|
| 89 |
-
- Qwen models only for now (architecture-specific engine)
|
| 90 |
-
- Apple Silicon only for MLX path (llama.cpp path works on CUDA too)
|
| 91 |
|
| 92 |
-
|
| 93 |
|
| 94 |
-
```
|
| 95 |
-
|
| 96 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
```
|
| 98 |
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
```bash
|
| 102 |
-
# Preprocess model (one-time, ~17 GB on disk)
|
| 103 |
-
mlx-sniper preprocess <hf-model-dir> -o ~/models/qwen3-30b
|
| 104 |
-
|
| 105 |
-
# Or use the streaming preprocessor (downloads one shard at a time):
|
| 106 |
-
python3 stream_preprocess.py
|
| 107 |
-
|
| 108 |
-
# Generate
|
| 109 |
-
mlx-sniper run ~/models/qwen3-30b -p "What is 2+2?" -v
|
| 110 |
-
|
| 111 |
-
# Interactive chat
|
| 112 |
-
mlx-sniper chat ~/models/qwen3-30b
|
| 113 |
-
|
| 114 |
-
# OpenAI-compatible server
|
| 115 |
-
mlx-sniper server ~/models/qwen3-30b --port 8899
|
| 116 |
|
| 117 |
-
#
|
| 118 |
-
mlx-sniper profile ~/models/qwen3-30b --tokens 100
|
| 119 |
|
| 120 |
-
|
| 121 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
```
|
| 123 |
|
| 124 |
-
##
|
| 125 |
-
|
| 126 |
-
| File | Purpose |
|
| 127 |
-
|------|---------|
|
| 128 |
-
| `src/mlx_expert_sniper/sniper.py` | Core engine β the proven forward pass with expert sniping |
|
| 129 |
-
| `src/mlx_expert_sniper/cache.py` | Per-expert LRU cache + pread/F_NOCACHE binary reader |
|
| 130 |
-
| `src/mlx_expert_sniper/config.py` | SniperConfig dataclass |
|
| 131 |
-
| `src/mlx_expert_sniper/generate.py` | High-level generate / stream_generate |
|
| 132 |
-
| `src/mlx_expert_sniper/preprocess.py` | Convert HuggingFace model β sniper binary format |
|
| 133 |
-
| `src/mlx_expert_sniper/server.py` | OpenAI-compatible HTTP server |
|
| 134 |
-
| `src/mlx_expert_sniper/profile.py` | Per-token profiling tools |
|
| 135 |
-
| `src/mlx_expert_sniper/cli.py` | `mlx-sniper` CLI entry point |
|
| 136 |
-
| `stream_preprocess.py` | Streaming preprocessor (downloads one shard at a time) |
|
| 137 |
|
| 138 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
-
|
| 141 |
-
- `cache.py` = extracted from `../expert_io.py`
|
| 142 |
-
- `sniper.py:forward_token()` = same loop as `Qwen3SniperEngine.forward_token()`
|
| 143 |
-
- `preprocess.py` = extracted from `../convert_qwen3_30b.py`
|
| 144 |
-
- `server.py` = extracted from `../sniper_server.py`
|
| 145 |
|
| 146 |
-
|
| 147 |
|
| 148 |
-
##
|
| 149 |
|
| 150 |
-
|
| 151 |
-
from mlx_expert_sniper import SniperEngine
|
| 152 |
|
| 153 |
-
|
| 154 |
|
| 155 |
-
|
| 156 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
```
|
|
|
|
| 1 |
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
tags:
|
|
|
|
|
|
|
|
|
|
| 4 |
- mlx
|
| 5 |
+
- apple-silicon
|
| 6 |
+
- moe
|
| 7 |
+
- mixture-of-experts
|
| 8 |
+
- vision-language
|
| 9 |
+
- gemma
|
| 10 |
+
- falcon-perception
|
| 11 |
- inference
|
| 12 |
+
language:
|
| 13 |
+
- en
|
| 14 |
+
library_name: mlx
|
| 15 |
+
pipeline_tag: image-text-to-text
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# MLX Expert Sniper
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
**Run 26B+ MoE vision-language models on a 16 GB Apple Silicon Mac.**
|
| 21 |
|
| 22 |
+
A single-machine inference engine that streams Mixture-of-Experts weights from SSD on demand. The active resident set stays under 4 GB even though the full model is 13β17 GB. Confirmed working with Gemma 4-26B-A4B (vision) + Falcon Perception (grounded segmentation) on a 16 GB M4 Mac Mini.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+
```
|
| 25 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 26 |
+
β http://localhost:8500 web chat + REST API β
|
| 27 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 28 |
+
β /api/chat_vision chained Gemma β Falcon β
|
| 29 |
+
β /api/falcon direct grounded segm β
|
| 30 |
+
β /api/turbo_chat Gemma vision β Qwen brain β
|
| 31 |
+
βββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
|
| 32 |
+
β
|
| 33 |
+
ββββββββββββ΄βββββββββββ
|
| 34 |
+
βΌ βΌ
|
| 35 |
+
ββββββββββββββββββ ββββββββββββββββββββββββ
|
| 36 |
+
β Gemma 4-26B β β Falcon Perception β
|
| 37 |
+
β A4B vision β β 0.6B segmentation β
|
| 38 |
+
β ~3 GB resident β β ~1.5 GB resident β
|
| 39 |
+
β via Sniper β β via mlx-vlm β
|
| 40 |
+
β (SSD streaming)β β β
|
| 41 |
+
ββββββββββββββββββ ββββββββββββββββββββββββ
|
| 42 |
+
```
|
| 43 |
|
| 44 |
+
## Why this exists
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
+
MoE models only activate ~3β15% of their parameters per token. The other 85β97% sit idle in RAM. **Expert Sniper unloads cold experts to SSD and pages them in on demand**, so a 26B model behaves like a 4B model for memory pressure.
|
| 47 |
|
| 48 |
+
| Setup | Resident RAM | Quality | Speed (M4 Mac Mini) |
|
| 49 |
+
|---|---|---|---|
|
| 50 |
+
| Gemma 4-26B 4-bit, vanilla MLX | ~13 GB | full 4-bit | 4 tok/s β but OOMs on 16 GB Macs |
|
| 51 |
+
| **Gemma 4-26B 4-bit + Sniper** | **~3 GB** | **full 4-bit** | **4.15 tok/s** |
|
| 52 |
+
| Gemma 4-26B 2-bit (hypothetical) | ~6.5 GB | -5β10% perplexity | n/a |
|
| 53 |
|
| 54 |
+
The sniper-streamed 4-bit gives you smaller RAM than vanilla 2-bit *with no quality loss*.
|
| 55 |
|
| 56 |
+
## Install (from scratch on a fresh Apple Silicon Mac)
|
| 57 |
|
| 58 |
+
Requires macOS 14+, M1/M2/M3/M4 (16 GB+ RAM recommended), Python 3.10+, ~30 GB free disk.
|
| 59 |
|
| 60 |
+
```bash
|
| 61 |
+
# 1. Clone the deploy repo
|
| 62 |
+
git clone https://github.com/walter-grace/mac-code
|
| 63 |
+
cd mac-code/research/expert-sniper/distributed
|
| 64 |
+
python3 -m venv venv && source venv/bin/activate
|
| 65 |
+
pip install -e .
|
| 66 |
+
pip install mlx mlx-vlm fastapi uvicorn pillow huggingface_hub
|
| 67 |
+
|
| 68 |
+
# 2. Download stock Gemma 4 (one time, ~13 GB)
|
| 69 |
+
huggingface-cli download mlx-community/gemma-4-26b-a4b-it-4bit \
|
| 70 |
+
--local-dir ~/models/gemma4-source
|
| 71 |
+
|
| 72 |
+
# 3. Split for SSD streaming (one time, ~5 minutes)
|
| 73 |
+
python3 split_gemma4.py \
|
| 74 |
+
--input ~/models/gemma4-source \
|
| 75 |
+
--output ~/models/gemma4-stream
|
| 76 |
+
# Produces: ~/models/gemma4-stream/{pinned.safetensors, bin/layer_XX.bin}
|
| 77 |
+
|
| 78 |
+
# 4. Falcon Perception downloads automatically on first run
|
| 79 |
+
# (~1.5 GB, from tiiuae/Falcon-Perception)
|
| 80 |
+
|
| 81 |
+
# 5. Launch
|
| 82 |
+
python3 -m mac_tensor.cli ui --vision --falcon \
|
| 83 |
+
--stream-dir ~/models/gemma4-stream \
|
| 84 |
+
--source-dir ~/models/gemma4-source \
|
| 85 |
+
--port 8500
|
| 86 |
+
```
|
| 87 |
|
| 88 |
+
Open `http://localhost:8500` in a browser. Drop an image, ask Gemma to describe it, then click **Ground** for Falcon to outline objects precisely.
|
| 89 |
|
| 90 |
+
## Three modes β pick the right flag
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
+
| Flag combo | What loads | Resident RAM | Use for |
|
| 93 |
+
|---|---|---|---|
|
| 94 |
+
| `--vision --falcon` | Gemma 4 + Falcon | ~5 GB | Full chained vision agent |
|
| 95 |
+
| `--vision` | Gemma 4 only | ~3 GB | Vision chat without segmentation |
|
| 96 |
+
| `--falcon-only` | Falcon only, no Gemma | ~1.5 GB | Batch labeling pipelines |
|
| 97 |
+
| `--nodes β¦` | distributed text-only | ~1.5 GB coordinator | Multi-Mac MoE chat |
|
| 98 |
|
| 99 |
+
## REST endpoints
|
| 100 |
|
| 101 |
+
| Endpoint | Purpose |
|
| 102 |
+
|---|---|
|
| 103 |
+
| `GET /api/info` | `{model, vision, falcon, swarm_leader}` capability discovery |
|
| 104 |
+
| `POST /api/chat_vision` | Chained Gemma β Falcon vision agent (multipart, SSE) |
|
| 105 |
+
| `POST /api/falcon` | Direct grounded segmentation (multipart) |
|
| 106 |
+
| `POST /api/turbo_chat` | Gemma vision encode + small-LLM reasoning chain |
|
| 107 |
|
| 108 |
+
## Verified hardware
|
| 109 |
|
| 110 |
+
| Mac | Cold start | First vision call | Notes |
|
| 111 |
+
|---|---|---|---|
|
| 112 |
+
| **M4 Mac Mini, 16 GB** | ~30s loading | ~14s | Confirmed working in production tonight |
|
| 113 |
+
| M2 Mac Mini, 16 GB | ~45s | ~20s | Slower memory bandwidth |
|
| 114 |
+
| M3/M4 Pro/Max | faster | faster | Untested but expected to scale linearly with bandwidth |
|
| 115 |
|
| 116 |
+
## What's chained vision agent (the killer feature)
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
+
When you load both Gemma 4 + Falcon Perception, the server exposes a chained reasoning loop:
|
| 119 |
|
| 120 |
+
```
|
| 121 |
+
You: "Is the blue player offside in this image?"
|
| 122 |
+
β
|
| 123 |
+
βΌ
|
| 124 |
+
Gemma: "I need to find players, identify the second-to-last defender,
|
| 125 |
+
and compare the attacker's position. Let me ground the players."
|
| 126 |
+
β tool_call
|
| 127 |
+
βΌ
|
| 128 |
+
Falcon: β returns 22 player bboxes with centroids
|
| 129 |
+
β
|
| 130 |
+
βΌ
|
| 131 |
+
Gemma: "Sorting by x-coordinate... the second-to-last defender is at x=0.65,
|
| 132 |
+
the attacker is at x=0.71. The attacker is past the defender β offside."
|
| 133 |
```
|
| 134 |
|
| 135 |
+
This pattern works for any open-ended visual reasoning οΏ½οΏ½ the VLM picks what to look for, the segmenter measures it precisely.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
+
## Architecture
|
|
|
|
| 138 |
|
| 139 |
+
```
|
| 140 |
+
research/expert-sniper/distributed/
|
| 141 |
+
βββ mac_tensor/ β the deploy package
|
| 142 |
+
β βββ cli.py β `mac-tensor` CLI
|
| 143 |
+
β βββ server.py β FastAPI app
|
| 144 |
+
β βββ vision_engine.py β Gemma 4 sniper wrapper
|
| 145 |
+
β βββ falcon_perception.py β Falcon mlx-vlm wrapper
|
| 146 |
+
β βββ agent.py β chained vision agent loop
|
| 147 |
+
β βββ static/chat.html β web UI
|
| 148 |
+
βββ split_gemma4.py β weight-splitting tool
|
| 149 |
+
βββ README.md β this file
|
| 150 |
+
βββ ...
|
| 151 |
```
|
| 152 |
|
| 153 |
+
## Credits
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 154 |
|
| 155 |
+
- **Gemma 4-26B-A4B** by [Google DeepMind](https://huggingface.co/google) β Apache 2.0
|
| 156 |
+
- **Falcon Perception** by [TII](https://huggingface.co/tiiuae/Falcon-Perception) β Apache 2.0
|
| 157 |
+
- **MLX** by [Apple Machine Learning Research](https://github.com/ml-explore/mlx) β MIT
|
| 158 |
+
- **mlx-vlm** by [Prince Canuma](https://github.com/Blaizzy/mlx-vlm) β MIT
|
| 159 |
+
- Expert Sniper streaming engine by Walter Grace
|
| 160 |
|
| 161 |
+
## License
|
|
|
|
|
|
|
|
|
|
|
|
|
| 162 |
|
| 163 |
+
Apache 2.0 (the deploy code). Model weights are subject to their respective licenses (Gemma Terms of Use, Falcon Apache 2.0).
|
| 164 |
|
| 165 |
+
## Star the GitHub repo
|
| 166 |
|
| 167 |
+
π https://github.com/walter-grace/mac-code
|
|
|
|
| 168 |
|
| 169 |
+
## Citation
|
| 170 |
|
| 171 |
+
```bibtex
|
| 172 |
+
@software{mlx_expert_sniper_2026,
|
| 173 |
+
author = {Walter Grace},
|
| 174 |
+
title = {MLX Expert Sniper: Streaming MoE inference for Apple Silicon},
|
| 175 |
+
year = {2026},
|
| 176 |
+
url = {https://github.com/walter-grace/mac-code},
|
| 177 |
+
}
|
| 178 |
```
|