MetroLLM-Bench Mac Probe

Slim distribution of the MetroLLM-Bench hardware-envelope tools for Apple Silicon. Pulls GGUF weights from continker/Qwen3.5-{2B,4B,9B}-metro-v23, runs llama-server locally on Metal, executes a 15-case stratified MARTA probe (and optional sustained-load thermal curve), and emits JSON telemetry — decode tok/s, TTFT, peak RAM, Tier-1 deterministic accuracy + Tier-2 LLM-judge composite.

This repo exists so corporate or ephemeral Macs can git clone the bench without VPN access to the private project repo.

Prerequisites (one-time)

brew install uv llama.cpp
export ANTHROPIC_API_KEY=sk-ant-...   # for the Tier-2 LLM judge

If brew is unavailable: uv has a curl -LsSf https://astral.sh/uv/install.sh | sh fallback; llama.cpp ships official Apple Silicon release binaries on GitHub.

Run a 15-case probe (~10-30 min depending on Mac)

git clone https://huggingface.co/continker/metrollm-bench-mac /tmp/mac-bench
cd /tmp/mac-bench
uv sync
bash scripts/mac_bench/run_probe.sh 2b
cat results/mac_bench/*-2b-probe/telemetry.json

Output captures:

tier1_composite, metrollm_composite — bench scores (deterministic + judge)
decode_tok_s_median / _p10 / _p90 — single-stream Metal decode throughput
ttft_ms_median — first-token latency end-to-end (HTTP + decode)
peak_rss_gb — max RSS of llama-server during decode
runner_wallclock_s — total wall time

Run a sustained-load thermal curve (fanless Macs only, ~45 min)

bash scripts/mac_bench/run_thermal.sh 2b --duration 45m

Replays MARTA cases on a loop while thermal_sampler.py records tok/s + RSS every 30 s. Captures cold → sustained → throttle behaviour. Output: results/mac_bench/<chip>-<ram>gb-2b-thermal/thermal_curve.{csv,json}.

Cleanup

rm -rf /tmp/mac-bench

(Optionally brew uninstall uv llama.cpp if you don't keep them around.)

Per-Mac context-size requirements

llama.cpp allocates the full KV cache upfront at server start. Defaults already cover the measured p99 conversation length per model:

Size	Default ctx	KV memory	Total RAM
2B	32 768	1.21 GB	~4 GB
4B	16 384	2.42 GB	~6.5 GB
9B	16 384	2.42 GB	~9.2 GB (tight on 16 GB Macs)

Override with --ctx N (e.g. bash scripts/mac_bench/run_probe.sh 9b --ctx 8192).

License & attribution

Apache 2.0. See parent project for full citation.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support