MetroLLM-Bench Mac Probe

Slim distribution of the MetroLLM-Bench hardware-envelope tools for Apple Silicon. Pulls GGUF weights from continker/Qwen3.5-{2B,4B,9B}-metro-v23, runs llama-server locally on Metal, executes a 15-case stratified MARTA probe (and optional sustained-load thermal curve), and emits JSON telemetry β€” decode tok/s, TTFT, peak RAM, Tier-1 deterministic accuracy + Tier-2 LLM-judge composite.

This repo exists so corporate or ephemeral Macs can git clone the bench without VPN access to the private project repo.

Prerequisites (one-time)

brew install uv llama.cpp
export ANTHROPIC_API_KEY=sk-ant-...   # for the Tier-2 LLM judge

If brew is unavailable: uv has a curl -LsSf https://astral.sh/uv/install.sh | sh fallback; llama.cpp ships official Apple Silicon release binaries on GitHub.

Run a 15-case probe (~10-30 min depending on Mac)

git clone https://huggingface.co/continker/metrollm-bench-mac /tmp/mac-bench
cd /tmp/mac-bench
uv sync
bash scripts/mac_bench/run_probe.sh 2b
cat results/mac_bench/*-2b-probe/telemetry.json

Output captures:

  • tier1_composite, metrollm_composite β€” bench scores (deterministic + judge)
  • decode_tok_s_median / _p10 / _p90 β€” single-stream Metal decode throughput
  • ttft_ms_median β€” first-token latency end-to-end (HTTP + decode)
  • peak_rss_gb β€” max RSS of llama-server during decode
  • runner_wallclock_s β€” total wall time

Run a sustained-load thermal curve (fanless Macs only, ~45 min)

bash scripts/mac_bench/run_thermal.sh 2b --duration 45m

Replays MARTA cases on a loop while thermal_sampler.py records tok/s + RSS every 30 s. Captures cold β†’ sustained β†’ throttle behaviour. Output: results/mac_bench/<chip>-<ram>gb-2b-thermal/thermal_curve.{csv,json}.

Cleanup

rm -rf /tmp/mac-bench

(Optionally brew uninstall uv llama.cpp if you don't keep them around.)

Per-Mac context-size requirements

llama.cpp allocates the full KV cache upfront at server start. Defaults already cover the measured p99 conversation length per model:

Size Default ctx KV memory Total RAM
2B 32 768 1.21 GB ~4 GB
4B 16 384 2.42 GB ~6.5 GB
9B 16 384 2.42 GB ~9.2 GB (tight on 16 GB Macs)

Override with --ctx N (e.g. bash scripts/mac_bench/run_probe.sh 9b --ctx 8192).

License & attribution

Apache 2.0. See parent project for full citation.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support