waltgrace
/

mlx-expert-sniper

@@ -1,157 +1,178 @@
 ---
-library_name: mlx
 tags:
-  - moe
-  - apple-silicon
-  - expert-sniping
   - mlx
   - inference
-license: mit
 ---
-# CLI Agent — `mlx-expert-sniper`
-Pip-installable CLI that wraps the Expert Sniper research into a production tool.
-Run MoE models larger than your RAM on Apple Silicon.
-## Verified Results (M4 Mac Mini, 16 GB)
-| Model | Size | Experts | Standard mlx_lm | Sniper tok/s | Cache hit | RAM |
-|-------|------|---------|-----------------|--------------|-----------|-----|
-| Qwen3.5-35B-A3B | 19.5 GB | 256/layer | OOM | **5.37 tok/s** | 92.0% | 8.7 GB |
-| Qwen3-30B-A3B | 17.2 GB | 128/layer | OOM | **4.29 tok/s** | 90.4% | 8.7 GB |
-| **Gemma 4-26B-A4B** | 15.6 GB | 128/layer | OOM | **4.15 tok/s** | 95.8% | 7.8 GB |
-All benchmarks: M4 Mac Mini 16 GB, 5 varied prompts, greedy decoding.
-**35B** (6.9x speedup from software alone):
-- Cache-aware routing bias=1.0 (biases router toward cached experts before softmax)
-- Co-activation predictive prefetch (predicts next layer's experts)
-- Right-sized LRU cache (4000 experts, 8.7 GB)
-- TTFT: 2.9s, all answers verified correct from cold start
-- REAP dead-expert masking tested (45% dead) but NOT used — static masking breaks factual knowledge on topics outside calibration set
-**30B**: right-sized LRU + co-activation prefetch. REAP/bias not yet applied.
-**Gemma 4-26B-A4B** (NEW):
-- Custom Gemma 4 model class (sliding/full attention hybrid, layer scalars, dual layernorms)
-- Mixed quantization: experts 4-bit, dense MLP and router 8-bit (matches mlx-community format)
-- Cache-aware routing bias=1.5 + co-activation prefetch (95.8% hit rate)
-- Source: `mlx-community/gemma-4-26b-a4b-it-4bit`
-## Supported Models
-| Model | Size | Experts | tok/s (M4 16GB) | Status |
-|-------|------|---------|-----------------|--------|
-| Qwen3.5-35B-A3B | 19.5 GB | 256/layer | 5.37 tok/s | Verified |
-| Qwen3-30B-A3B | 17.2 GB | 128/layer | 4.29 tok/s | Verified |
-| **Gemma 4-26B-A4B** | 15.6 GB | 128/layer | **4.15 tok/s** | Verified |
-More models coming. To request a model, open an issue on [GitHub](https://github.com/walter-grace/mac-code).
-### Memory Bandwidth Scaling
-MoE inference is bandwidth-bound. Expected speeds on different Apple Silicon Macs:
-| Mac | Memory BW | Qwen 35B est. | Gemma 4-26B est. |
-|-----|-----------|---------------|------------------|
-| M2 Mac Mini | 100 GB/s | ~4.5 tok/s | ~3.5 tok/s |
-| **M4 Mac Mini** | **120 GB/s** | **5.37 tok/s** ✓ | **4.15 tok/s** ✓ |
-| M2 Pro Mac Mini | 200 GB/s | ~8-10 tok/s | ~7-8 tok/s |
-| M4 Pro Mac Mini | 273 GB/s | ~12-14 tok/s | ~10-11 tok/s |
-| M2 Max Studio | 400 GB/s | ~16-20 tok/s | ~14-17 tok/s |
-| M4 Max MacBook Pro | 546 GB/s | ~22-28 tok/s | ~18-23 tok/s |
-| M2 Ultra Studio | 800 GB/s | ~30-40 tok/s | ~25-32 tok/s |
-### Hardware Requirements
-| Mac | RAM | What you can run |
-|-----|-----|-----------------|
-| Any Apple Silicon | 8 GB | llama.cpp path only (0.57 tok/s) |
-| M1/M2/M3/M4 | 16 GB | Qwen3.5-35B-A3B at 5.4 tok/s |
-| M1/M2/M3/M4 Pro/Max | 32 GB+ | Larger models, faster speeds |
-### Routing Bias: Universal Sweet Spot
-Tested across both models — bias=1.0 is the safe maximum regardless of expert count:
-| Model | Experts | No bias | bias=1.0 | Speedup | bias=1.5 |
-|-------|---------|---------|----------|---------|----------|
-| Qwen3.5-35B-A3B | 256/layer | 2.42 tok/s | **5.37 tok/s** | 2.2x | Quality degrades |
-| Qwen3-30B-A3B | 128/layer | 3.34 tok/s | **4.29 tok/s** | 1.3x | Quality degrades |
-At bias=1.5, both models fail the same question ("What is the capital of Australia?" — answers "no capital" instead of "Canberra"). The safe threshold is a property of MoE routing, not expert count.
-`mlx-sniper calibrate` finds this automatically.
-### Limitations
-- MoE architectures only (no dense models)
-- Qwen models only for now (architecture-specific engine)
-- Apple Silicon only for MLX path (llama.cpp path works on CUDA too)
-## Install
-```bash
-cd research/expert-sniper/cli-agent
-pip install -e .
 ```
-## Usage
-```bash
-# Preprocess model (one-time, ~17 GB on disk)
-mlx-sniper preprocess <hf-model-dir> -o ~/models/qwen3-30b
-# Or use the streaming preprocessor (downloads one shard at a time):
-python3 stream_preprocess.py
-# Generate
-mlx-sniper run ~/models/qwen3-30b -p "What is 2+2?" -v
-# Interactive chat
-mlx-sniper chat ~/models/qwen3-30b
-# OpenAI-compatible server
-mlx-sniper server ~/models/qwen3-30b --port 8899
-# Profile performance
-mlx-sniper profile ~/models/qwen3-30b --tokens 100
-# Show model info
-mlx-sniper info ~/models/qwen3-30b
 ```
-## Files
-| File | Purpose |
-|------|---------|
-| `src/mlx_expert_sniper/sniper.py` | Core engine — the proven forward pass with expert sniping |
-| `src/mlx_expert_sniper/cache.py` | Per-expert LRU cache + pread/F_NOCACHE binary reader |
-| `src/mlx_expert_sniper/config.py` | SniperConfig dataclass |
-| `src/mlx_expert_sniper/generate.py` | High-level generate / stream_generate |
-| `src/mlx_expert_sniper/preprocess.py` | Convert HuggingFace model → sniper binary format |
-| `src/mlx_expert_sniper/server.py` | OpenAI-compatible HTTP server |
-| `src/mlx_expert_sniper/profile.py` | Per-token profiling tools |
-| `src/mlx_expert_sniper/cli.py` | `mlx-sniper` CLI entry point |
-| `stream_preprocess.py` | Streaming preprocessor (downloads one shard at a time) |
-## How It Relates to the Research
-This is a packaged version of the same forward pass in `../qwen3_agent.py`:
-- `cache.py` = extracted from `../expert_io.py`
-- `sniper.py:forward_token()` = same loop as `Qwen3SniperEngine.forward_token()`
-- `preprocess.py` = extracted from `../convert_qwen3_30b.py`
-- `server.py` = extracted from `../sniper_server.py`
-No algorithmic changes — same pread + F_NOCACHE + per-expert LRU + gather_qmm.
-## Python API
-```python
-from mlx_expert_sniper import SniperEngine
-engine = SniperEngine.from_dir("~/models/qwen3-30b")
-for token in engine.generate("Write a haiku about AI"):
-    print(token, end="", flush=True)
 ```

 ---
+license: apache-2.0
 tags:
   - mlx
+  - apple-silicon
+  - moe
+  - mixture-of-experts
+  - vision-language
+  - gemma
+  - falcon-perception
   - inference
+language:
+  - en
+library_name: mlx
+pipeline_tag: image-text-to-text
 ---
+# MLX Expert Sniper
+**Run 26B+ MoE vision-language models on a 16 GB Apple Silicon Mac.**
+A single-machine inference engine that streams Mixture-of-Experts weights from SSD on demand. The active resident set stays under 4 GB even though the full model is 13–17 GB. Confirmed working with Gemma 4-26B-A4B (vision) + Falcon Perception (grounded segmentation) on a 16 GB M4 Mac Mini.
+```
+┌────────────────────────────────────────────────┐
+│  http://localhost:8500   web chat + REST API   │
+├────────────────────────────────────────────────┤
+│  /api/chat_vision   chained Gemma → Falcon     │
+│  /api/falcon        direct grounded segm       │
+│  /api/turbo_chat    Gemma vision → Qwen brain  │
+└─────────────┬──────────────────────────────────┘
+              │
+   ┌──────────┴──────────┐
+   ▼                     ▼
+┌────────────────┐  ┌──────────────────────┐
+│ Gemma 4-26B    │  │ Falcon Perception    │
+│ A4B vision     │  │ 0.6B segmentation    │
+│ ~3 GB resident │  │ ~1.5 GB resident     │
+│ via Sniper     │  │ via mlx-vlm          │
+│ (SSD streaming)│  │                      │
+└────────────────┘  └──────────────────────┘
+```
+## Why this exists
+MoE models only activate ~3–15% of their parameters per token. The other 85–97% sit idle in RAM. **Expert Sniper unloads cold experts to SSD and pages them in on demand**, so a 26B model behaves like a 4B model for memory pressure.
+| Setup | Resident RAM | Quality | Speed (M4 Mac Mini) |
+|---|---|---|---|
+| Gemma 4-26B 4-bit, vanilla MLX | ~13 GB | full 4-bit | 4 tok/s — but OOMs on 16 GB Macs |
+| **Gemma 4-26B 4-bit + Sniper** | **~3 GB** | **full 4-bit** | **4.15 tok/s** |
+| Gemma 4-26B 2-bit (hypothetical) | ~6.5 GB | -5–10% perplexity | n/a |
+The sniper-streamed 4-bit gives you smaller RAM than vanilla 2-bit *with no quality loss*.
+## Install (from scratch on a fresh Apple Silicon Mac)
+Requires macOS 14+, M1/M2/M3/M4 (16 GB+ RAM recommended), Python 3.10+, ~30 GB free disk.
+```bash
+# 1. Clone the deploy repo
+git clone https://github.com/walter-grace/mac-code
+cd mac-code/research/expert-sniper/distributed
+python3 -m venv venv && source venv/bin/activate
+pip install -e .
+pip install mlx mlx-vlm fastapi uvicorn pillow huggingface_hub
+# 2. Download stock Gemma 4 (one time, ~13 GB)
+huggingface-cli download mlx-community/gemma-4-26b-a4b-it-4bit \
+    --local-dir ~/models/gemma4-source
+# 3. Split for SSD streaming (one time, ~5 minutes)
+python3 split_gemma4.py \
+    --input ~/models/gemma4-source \
+    --output ~/models/gemma4-stream
+# Produces: ~/models/gemma4-stream/{pinned.safetensors, bin/layer_XX.bin}
+# 4. Falcon Perception downloads automatically on first run
+#    (~1.5 GB, from tiiuae/Falcon-Perception)
+# 5. Launch
+python3 -m mac_tensor.cli ui --vision --falcon \
+    --stream-dir ~/models/gemma4-stream \
+    --source-dir ~/models/gemma4-source \
+    --port 8500
+```
+Open `http://localhost:8500` in a browser. Drop an image, ask Gemma to describe it, then click **Ground** for Falcon to outline objects precisely.
+## Three modes — pick the right flag
+| Flag combo | What loads | Resident RAM | Use for |
+|---|---|---|---|
+| `--vision --falcon` | Gemma 4 + Falcon | ~5 GB | Full chained vision agent |
+| `--vision` | Gemma 4 only | ~3 GB | Vision chat without segmentation |
+| `--falcon-only` | Falcon only, no Gemma | ~1.5 GB | Batch labeling pipelines |
+| `--nodes …` | distributed text-only | ~1.5 GB coordinator | Multi-Mac MoE chat |
+## REST endpoints
+| Endpoint | Purpose |
+|---|---|
+| `GET /api/info` | `{model, vision, falcon, swarm_leader}` capability discovery |
+| `POST /api/chat_vision` | Chained Gemma → Falcon vision agent (multipart, SSE) |
+| `POST /api/falcon` | Direct grounded segmentation (multipart) |
+| `POST /api/turbo_chat` | Gemma vision encode + small-LLM reasoning chain |
+## Verified hardware
+| Mac | Cold start | First vision call | Notes |
+|---|---|---|---|
+| **M4 Mac Mini, 16 GB** | ~30s loading | ~14s | Confirmed working in production tonight |
+| M2 Mac Mini, 16 GB | ~45s | ~20s | Slower memory bandwidth |
+| M3/M4 Pro/Max | faster | faster | Untested but expected to scale linearly with bandwidth |
+## What's chained vision agent (the killer feature)
+When you load both Gemma 4 + Falcon Perception, the server exposes a chained reasoning loop:
+```
+You: "Is the blue player offside in this image?"
+       │
+       ▼
+Gemma:  "I need to find players, identify the second-to-last defender,
+         and compare the attacker's position. Let me ground the players."
+       │ tool_call
+       ▼
+Falcon: → returns 22 player bboxes with centroids
+       │
+       ▼
+Gemma:  "Sorting by x-coordinate... the second-to-last defender is at x=0.65,
+         the attacker is at x=0.71. The attacker is past the defender → offside."
 ```
+This pattern works for any open-ended visual reasoning �� the VLM picks what to look for, the segmenter measures it precisely.
+## Architecture
+```
+research/expert-sniper/distributed/
+├── mac_tensor/                          ← the deploy package
+│   ├── cli.py                           ← `mac-tensor` CLI
+│   ├── server.py                        ← FastAPI app
+│   ├── vision_engine.py                 ← Gemma 4 sniper wrapper
+│   ├── falcon_perception.py             ← Falcon mlx-vlm wrapper
+│   ├── agent.py                         ← chained vision agent loop
+│   └── static/chat.html                 ← web UI
+├── split_gemma4.py                      ← weight-splitting tool
+├── README.md                            ← this file
+└── ...
 ```
+## Credits
+- **Gemma 4-26B-A4B** by [Google DeepMind](https://huggingface.co/google) — Apache 2.0
+- **Falcon Perception** by [TII](https://huggingface.co/tiiuae/Falcon-Perception) — Apache 2.0
+- **MLX** by [Apple Machine Learning Research](https://github.com/ml-explore/mlx) — MIT
+- **mlx-vlm** by [Prince Canuma](https://github.com/Blaizzy/mlx-vlm) — MIT
+- Expert Sniper streaming engine by Walter Grace
+## License
+Apache 2.0 (the deploy code). Model weights are subject to their respective licenses (Gemma Terms of Use, Falcon Apache 2.0).
+## Star the GitHub repo
+🌟 https://github.com/walter-grace/mac-code
+## Citation
+```bibtex
+@software{mlx_expert_sniper_2026,
+  author = {Walter Grace},
+  title  = {MLX Expert Sniper: Streaming MoE inference for Apple Silicon},
+  year   = {2026},
+  url    = {https://github.com/walter-grace/mac-code},
+}
 ```