MLX Expert Sniper

Run 26B+ MoE vision-language models on a 16 GB Apple Silicon Mac.

A single-machine inference engine that streams Mixture-of-Experts weights from SSD on demand. The active resident set stays under 4 GB even though the full model is 13–17 GB. Confirmed working with Gemma 4-26B-A4B (vision) + Falcon Perception (grounded segmentation) on a 16 GB M4 Mac Mini.

┌────────────────────────────────────────────────┐
│  http://localhost:8500   web chat + REST API   │
├────────────────────────────────────────────────┤
│  /api/chat_vision   chained Gemma → Falcon     │
│  /api/falcon        direct grounded segm       │
│  /api/turbo_chat    Gemma vision → Qwen brain  │
└─────────────┬──────────────────────────────────┘
              │
   ┌──────────┴──────────┐
   ▼                     ▼
┌────────────────┐  ┌──────────────────────┐
│ Gemma 4-26B    │  │ Falcon Perception    │
│ A4B vision     │  │ 0.6B segmentation    │
│ ~3 GB resident │  │ ~1.5 GB resident     │
│ via Sniper     │  │ via mlx-vlm          │
│ (SSD streaming)│  │                      │
└────────────────┘  └──────────────────────┘

Why this exists

MoE models only activate ~3–15% of their parameters per token. The other 85–97% sit idle in RAM. Expert Sniper unloads cold experts to SSD and pages them in on demand, so a 26B model behaves like a 4B model for memory pressure.

Setup	Resident RAM	Quality	Speed (M4 Mac Mini)
Gemma 4-26B 4-bit, vanilla MLX	~13 GB	full 4-bit	4 tok/s — but OOMs on 16 GB Macs
Gemma 4-26B 4-bit + Sniper	~3 GB	full 4-bit	4.15 tok/s
Gemma 4-26B 2-bit (hypothetical)	~6.5 GB	-5–10% perplexity	n/a

The sniper-streamed 4-bit gives you smaller RAM than vanilla 2-bit with no quality loss.

Install (from scratch on a fresh Apple Silicon Mac)

Requires macOS 14+, M1/M2/M3/M4 (16 GB+ RAM recommended), Python 3.10+, ~30 GB free disk.

# 1. Clone the deploy repo
git clone https://github.com/walter-grace/mac-code
cd mac-code/research/expert-sniper/distributed
python3 -m venv venv && source venv/bin/activate
pip install -e .
pip install mlx mlx-vlm fastapi uvicorn pillow huggingface_hub

# 2. Download stock Gemma 4 (one time, ~13 GB)
huggingface-cli download mlx-community/gemma-4-26b-a4b-it-4bit \
    --local-dir ~/models/gemma4-source

# 3. Split for SSD streaming (one time, ~5 minutes)
python3 split_gemma4.py \
    --input ~/models/gemma4-source \
    --output ~/models/gemma4-stream
# Produces: ~/models/gemma4-stream/{pinned.safetensors, bin/layer_XX.bin}

# 4. Falcon Perception downloads automatically on first run
#    (~1.5 GB, from tiiuae/Falcon-Perception)

# 5. Launch
python3 -m mac_tensor.cli ui --vision --falcon \
    --stream-dir ~/models/gemma4-stream \
    --source-dir ~/models/gemma4-source \
    --port 8500

Open http://localhost:8500 in a browser. Drop an image, ask Gemma to describe it, then click Ground for Falcon to outline objects precisely.

Three modes — pick the right flag

Flag combo	What loads	Resident RAM	Use for
`--vision --falcon`	Gemma 4 + Falcon	~5 GB	Full chained vision agent
`--vision`	Gemma 4 only	~3 GB	Vision chat without segmentation
`--falcon-only`	Falcon only, no Gemma	~1.5 GB	Batch labeling pipelines
`--nodes …`	distributed text-only	~1.5 GB coordinator	Multi-Mac MoE chat

REST endpoints

Endpoint	Purpose
`GET /api/info`	`{model, vision, falcon, swarm_leader}` capability discovery
`POST /api/chat_vision`	Chained Gemma → Falcon vision agent (multipart, SSE)
`POST /api/falcon`	Direct grounded segmentation (multipart)
`POST /api/turbo_chat`	Gemma vision encode + small-LLM reasoning chain

Verified hardware

Mac	Cold start	First vision call	Notes
M4 Mac Mini, 16 GB	~30s loading	~14s	Confirmed working in production tonight
M2 Mac Mini, 16 GB	~45s	~20s	Slower memory bandwidth
M3/M4 Pro/Max	faster	faster	Untested but expected to scale linearly with bandwidth

What's chained vision agent (the killer feature)

When you load both Gemma 4 + Falcon Perception, the server exposes a chained reasoning loop:

You: "Is the blue player offside in this image?"
       │
       ▼
Gemma:  "I need to find players, identify the second-to-last defender,
         and compare the attacker's position. Let me ground the players."
       │ tool_call
       ▼
Falcon: → returns 22 player bboxes with centroids
       │
       ▼
Gemma:  "Sorting by x-coordinate... the second-to-last defender is at x=0.65,
         the attacker is at x=0.71. The attacker is past the defender → offside."

This pattern works for any open-ended visual reasoning — the VLM picks what to look for, the segmenter measures it precisely.

Architecture

research/expert-sniper/distributed/
├── mac_tensor/                          ← the deploy package
│   ├── cli.py                           ← `mac-tensor` CLI
│   ├── server.py                        ← FastAPI app
│   ├── vision_engine.py                 ← Gemma 4 sniper wrapper
│   ├── falcon_perception.py             ← Falcon mlx-vlm wrapper
│   ├── agent.py                         ← chained vision agent loop
│   └── static/chat.html                 ← web UI
├── split_gemma4.py                      ← weight-splitting tool
├── README.md                            ← this file
└── ...

Credits

Gemma 4-26B-A4B by Google DeepMind — Apache 2.0
Falcon Perception by TII — Apache 2.0
MLX by Apple Machine Learning Research — MIT
mlx-vlm by Prince Canuma — MIT
Expert Sniper streaming engine by Walter Grace

License

Apache 2.0 (the deploy code). Model weights are subject to their respective licenses (Gemma Terms of Use, Falcon Apache 2.0).

Star the GitHub repo

🌟 https://github.com/walter-grace/mac-code

Citation

@software{mlx_expert_sniper_2026,
  author = {Walter Grace},
  title  = {MLX Expert Sniper: Streaming MoE inference for Apple Silicon},
  year   = {2026},
  url    = {https://github.com/walter-grace/mac-code},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support