MLX Expert Sniper

Run 26B+ MoE vision-language models on a 16 GB Apple Silicon Mac.

A single-machine inference engine that streams Mixture-of-Experts weights from SSD on demand. The active resident set stays under 4 GB even though the full model is 13–17 GB. Confirmed working with Gemma 4-26B-A4B (vision) + Falcon Perception (grounded segmentation) on a 16 GB M4 Mac Mini.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  http://localhost:8500   web chat + REST API   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  /api/chat_vision   chained Gemma β†’ Falcon     β”‚
β”‚  /api/falcon        direct grounded segm       β”‚
β”‚  /api/turbo_chat    Gemma vision β†’ Qwen brain  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β–Ό                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Gemma 4-26B    β”‚  β”‚ Falcon Perception    β”‚
β”‚ A4B vision     β”‚  β”‚ 0.6B segmentation    β”‚
β”‚ ~3 GB resident β”‚  β”‚ ~1.5 GB resident     β”‚
β”‚ via Sniper     β”‚  β”‚ via mlx-vlm          β”‚
β”‚ (SSD streaming)β”‚  β”‚                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why this exists

MoE models only activate ~3–15% of their parameters per token. The other 85–97% sit idle in RAM. Expert Sniper unloads cold experts to SSD and pages them in on demand, so a 26B model behaves like a 4B model for memory pressure.

Setup Resident RAM Quality Speed (M4 Mac Mini)
Gemma 4-26B 4-bit, vanilla MLX ~13 GB full 4-bit 4 tok/s β€” but OOMs on 16 GB Macs
Gemma 4-26B 4-bit + Sniper ~3 GB full 4-bit 4.15 tok/s
Gemma 4-26B 2-bit (hypothetical) ~6.5 GB -5–10% perplexity n/a

The sniper-streamed 4-bit gives you smaller RAM than vanilla 2-bit with no quality loss.

Install (from scratch on a fresh Apple Silicon Mac)

Requires macOS 14+, M1/M2/M3/M4 (16 GB+ RAM recommended), Python 3.10+, ~30 GB free disk.

# 1. Clone the deploy repo
git clone https://github.com/walter-grace/mac-code
cd mac-code/research/expert-sniper/distributed
python3 -m venv venv && source venv/bin/activate
pip install -e .
pip install mlx mlx-vlm fastapi uvicorn pillow huggingface_hub

# 2. Download stock Gemma 4 (one time, ~13 GB)
huggingface-cli download mlx-community/gemma-4-26b-a4b-it-4bit \
    --local-dir ~/models/gemma4-source

# 3. Split for SSD streaming (one time, ~5 minutes)
python3 split_gemma4.py \
    --input ~/models/gemma4-source \
    --output ~/models/gemma4-stream
# Produces: ~/models/gemma4-stream/{pinned.safetensors, bin/layer_XX.bin}

# 4. Falcon Perception downloads automatically on first run
#    (~1.5 GB, from tiiuae/Falcon-Perception)

# 5. Launch
python3 -m mac_tensor.cli ui --vision --falcon \
    --stream-dir ~/models/gemma4-stream \
    --source-dir ~/models/gemma4-source \
    --port 8500

Open http://localhost:8500 in a browser. Drop an image, ask Gemma to describe it, then click Ground for Falcon to outline objects precisely.

Three modes β€” pick the right flag

Flag combo What loads Resident RAM Use for
--vision --falcon Gemma 4 + Falcon ~5 GB Full chained vision agent
--vision Gemma 4 only ~3 GB Vision chat without segmentation
--falcon-only Falcon only, no Gemma ~1.5 GB Batch labeling pipelines
--nodes … distributed text-only ~1.5 GB coordinator Multi-Mac MoE chat

REST endpoints

Endpoint Purpose
GET /api/info {model, vision, falcon, swarm_leader} capability discovery
POST /api/chat_vision Chained Gemma β†’ Falcon vision agent (multipart, SSE)
POST /api/falcon Direct grounded segmentation (multipart)
POST /api/turbo_chat Gemma vision encode + small-LLM reasoning chain

Verified hardware

Mac Cold start First vision call Notes
M4 Mac Mini, 16 GB ~30s loading ~14s Confirmed working in production tonight
M2 Mac Mini, 16 GB ~45s ~20s Slower memory bandwidth
M3/M4 Pro/Max faster faster Untested but expected to scale linearly with bandwidth

What's chained vision agent (the killer feature)

When you load both Gemma 4 + Falcon Perception, the server exposes a chained reasoning loop:

You: "Is the blue player offside in this image?"
       β”‚
       β–Ό
Gemma:  "I need to find players, identify the second-to-last defender,
         and compare the attacker's position. Let me ground the players."
       β”‚ tool_call
       β–Ό
Falcon: β†’ returns 22 player bboxes with centroids
       β”‚
       β–Ό
Gemma:  "Sorting by x-coordinate... the second-to-last defender is at x=0.65,
         the attacker is at x=0.71. The attacker is past the defender β†’ offside."

This pattern works for any open-ended visual reasoning β€” the VLM picks what to look for, the segmenter measures it precisely.

Architecture

research/expert-sniper/distributed/
β”œβ”€β”€ mac_tensor/                          ← the deploy package
β”‚   β”œβ”€β”€ cli.py                           ← `mac-tensor` CLI
β”‚   β”œβ”€β”€ server.py                        ← FastAPI app
β”‚   β”œβ”€β”€ vision_engine.py                 ← Gemma 4 sniper wrapper
β”‚   β”œβ”€β”€ falcon_perception.py             ← Falcon mlx-vlm wrapper
β”‚   β”œβ”€β”€ agent.py                         ← chained vision agent loop
β”‚   └── static/chat.html                 ← web UI
β”œβ”€β”€ split_gemma4.py                      ← weight-splitting tool
β”œβ”€β”€ README.md                            ← this file
└── ...

Credits

License

Apache 2.0 (the deploy code). Model weights are subject to their respective licenses (Gemma Terms of Use, Falcon Apache 2.0).

Star the GitHub repo

🌟 https://github.com/walter-grace/mac-code

Citation

@software{mlx_expert_sniper_2026,
  author = {Walter Grace},
  title  = {MLX Expert Sniper: Streaming MoE inference for Apple Silicon},
  year   = {2026},
  url    = {https://github.com/walter-grace/mac-code},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support