MetroLLM-Bench Mac Probe
Slim distribution of the MetroLLM-Bench
hardware-envelope tools for Apple Silicon. Pulls GGUF weights from
continker/Qwen3.5-{2B,4B,9B}-metro-v23, runs llama-server locally on
Metal, executes a 15-case stratified MARTA probe (and optional sustained-load
thermal curve), and emits JSON telemetry β decode tok/s, TTFT, peak RAM,
Tier-1 deterministic accuracy + Tier-2 LLM-judge composite.
This repo exists so corporate or ephemeral Macs can git clone the bench
without VPN access to the private project repo.
Prerequisites (one-time)
brew install uv llama.cpp
export ANTHROPIC_API_KEY=sk-ant-... # for the Tier-2 LLM judge
If brew is unavailable: uv has a curl -LsSf https://astral.sh/uv/install.sh | sh
fallback; llama.cpp ships official Apple Silicon release binaries on GitHub.
Run a 15-case probe (~10-30 min depending on Mac)
git clone https://huggingface.co/continker/metrollm-bench-mac /tmp/mac-bench
cd /tmp/mac-bench
uv sync
bash scripts/mac_bench/run_probe.sh 2b
cat results/mac_bench/*-2b-probe/telemetry.json
Output captures:
tier1_composite,metrollm_compositeβ bench scores (deterministic + judge)decode_tok_s_median/_p10/_p90β single-stream Metal decode throughputttft_ms_medianβ first-token latency end-to-end (HTTP + decode)peak_rss_gbβ max RSS ofllama-serverduring decoderunner_wallclock_sβ total wall time
Run a sustained-load thermal curve (fanless Macs only, ~45 min)
bash scripts/mac_bench/run_thermal.sh 2b --duration 45m
Replays MARTA cases on a loop while thermal_sampler.py records tok/s + RSS
every 30 s. Captures cold β sustained β throttle behaviour. Output:
results/mac_bench/<chip>-<ram>gb-2b-thermal/thermal_curve.{csv,json}.
Cleanup
rm -rf /tmp/mac-bench
(Optionally brew uninstall uv llama.cpp if you don't keep them around.)
Per-Mac context-size requirements
llama.cpp allocates the full KV cache upfront at server start. Defaults
already cover the measured p99 conversation length per model:
| Size | Default ctx | KV memory | Total RAM |
|---|---|---|---|
| 2B | 32 768 | 1.21 GB | ~4 GB |
| 4B | 16 384 | 2.42 GB | ~6.5 GB |
| 9B | 16 384 | 2.42 GB | ~9.2 GB (tight on 16 GB Macs) |
Override with --ctx N (e.g. bash scripts/mac_bench/run_probe.sh 9b --ctx 8192).
License & attribution
Apache 2.0. See parent project for full citation.