MLX Expert Sniper
Run 26B+ MoE vision-language models on a 16 GB Apple Silicon Mac.
A single-machine inference engine that streams Mixture-of-Experts weights from SSD on demand. The active resident set stays under 4 GB even though the full model is 13β17 GB. Confirmed working with Gemma 4-26B-A4B (vision) + Falcon Perception (grounded segmentation) on a 16 GB M4 Mac Mini.
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β http://localhost:8500 web chat + REST API β
ββββββββββββββββββββββββββββββββββββββββββββββββββ€
β /api/chat_vision chained Gemma β Falcon β
β /api/falcon direct grounded segm β
β /api/turbo_chat Gemma vision β Qwen brain β
βββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
ββββββββββββ΄βββββββββββ
βΌ βΌ
ββββββββββββββββββ ββββββββββββββββββββββββ
β Gemma 4-26B β β Falcon Perception β
β A4B vision β β 0.6B segmentation β
β ~3 GB resident β β ~1.5 GB resident β
β via Sniper β β via mlx-vlm β
β (SSD streaming)β β β
ββββββββββββββββββ ββββββββββββββββββββββββ
Why this exists
MoE models only activate ~3β15% of their parameters per token. The other 85β97% sit idle in RAM. Expert Sniper unloads cold experts to SSD and pages them in on demand, so a 26B model behaves like a 4B model for memory pressure.
| Setup | Resident RAM | Quality | Speed (M4 Mac Mini) |
|---|---|---|---|
| Gemma 4-26B 4-bit, vanilla MLX | ~13 GB | full 4-bit | 4 tok/s β but OOMs on 16 GB Macs |
| Gemma 4-26B 4-bit + Sniper | ~3 GB | full 4-bit | 4.15 tok/s |
| Gemma 4-26B 2-bit (hypothetical) | ~6.5 GB | -5β10% perplexity | n/a |
The sniper-streamed 4-bit gives you smaller RAM than vanilla 2-bit with no quality loss.
Install (from scratch on a fresh Apple Silicon Mac)
Requires macOS 14+, M1/M2/M3/M4 (16 GB+ RAM recommended), Python 3.10+, ~30 GB free disk.
# 1. Clone the deploy repo
git clone https://github.com/walter-grace/mac-code
cd mac-code/research/expert-sniper/distributed
python3 -m venv venv && source venv/bin/activate
pip install -e .
pip install mlx mlx-vlm fastapi uvicorn pillow huggingface_hub
# 2. Download stock Gemma 4 (one time, ~13 GB)
huggingface-cli download mlx-community/gemma-4-26b-a4b-it-4bit \
--local-dir ~/models/gemma4-source
# 3. Split for SSD streaming (one time, ~5 minutes)
python3 split_gemma4.py \
--input ~/models/gemma4-source \
--output ~/models/gemma4-stream
# Produces: ~/models/gemma4-stream/{pinned.safetensors, bin/layer_XX.bin}
# 4. Falcon Perception downloads automatically on first run
# (~1.5 GB, from tiiuae/Falcon-Perception)
# 5. Launch
python3 -m mac_tensor.cli ui --vision --falcon \
--stream-dir ~/models/gemma4-stream \
--source-dir ~/models/gemma4-source \
--port 8500
Open http://localhost:8500 in a browser. Drop an image, ask Gemma to describe it, then click Ground for Falcon to outline objects precisely.
Three modes β pick the right flag
| Flag combo | What loads | Resident RAM | Use for |
|---|---|---|---|
--vision --falcon |
Gemma 4 + Falcon | ~5 GB | Full chained vision agent |
--vision |
Gemma 4 only | ~3 GB | Vision chat without segmentation |
--falcon-only |
Falcon only, no Gemma | ~1.5 GB | Batch labeling pipelines |
--nodes β¦ |
distributed text-only | ~1.5 GB coordinator | Multi-Mac MoE chat |
REST endpoints
| Endpoint | Purpose |
|---|---|
GET /api/info |
{model, vision, falcon, swarm_leader} capability discovery |
POST /api/chat_vision |
Chained Gemma β Falcon vision agent (multipart, SSE) |
POST /api/falcon |
Direct grounded segmentation (multipart) |
POST /api/turbo_chat |
Gemma vision encode + small-LLM reasoning chain |
Verified hardware
| Mac | Cold start | First vision call | Notes |
|---|---|---|---|
| M4 Mac Mini, 16 GB | ~30s loading | ~14s | Confirmed working in production tonight |
| M2 Mac Mini, 16 GB | ~45s | ~20s | Slower memory bandwidth |
| M3/M4 Pro/Max | faster | faster | Untested but expected to scale linearly with bandwidth |
What's chained vision agent (the killer feature)
When you load both Gemma 4 + Falcon Perception, the server exposes a chained reasoning loop:
You: "Is the blue player offside in this image?"
β
βΌ
Gemma: "I need to find players, identify the second-to-last defender,
and compare the attacker's position. Let me ground the players."
β tool_call
βΌ
Falcon: β returns 22 player bboxes with centroids
β
βΌ
Gemma: "Sorting by x-coordinate... the second-to-last defender is at x=0.65,
the attacker is at x=0.71. The attacker is past the defender β offside."
This pattern works for any open-ended visual reasoning β the VLM picks what to look for, the segmenter measures it precisely.
Architecture
research/expert-sniper/distributed/
βββ mac_tensor/ β the deploy package
β βββ cli.py β `mac-tensor` CLI
β βββ server.py β FastAPI app
β βββ vision_engine.py β Gemma 4 sniper wrapper
β βββ falcon_perception.py β Falcon mlx-vlm wrapper
β βββ agent.py β chained vision agent loop
β βββ static/chat.html β web UI
βββ split_gemma4.py β weight-splitting tool
βββ README.md β this file
βββ ...
Credits
- Gemma 4-26B-A4B by Google DeepMind β Apache 2.0
- Falcon Perception by TII β Apache 2.0
- MLX by Apple Machine Learning Research β MIT
- mlx-vlm by Prince Canuma β MIT
- Expert Sniper streaming engine by Walter Grace
License
Apache 2.0 (the deploy code). Model weights are subject to their respective licenses (Gemma Terms of Use, Falcon Apache 2.0).
Star the GitHub repo
π https://github.com/walter-grace/mac-code
Citation
@software{mlx_expert_sniper_2026,
author = {Walter Grace},
title = {MLX Expert Sniper: Streaming MoE inference for Apple Silicon},
year = {2026},
url = {https://github.com/walter-grace/mac-code},
}