| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - transit |
| - kiosk |
| - benchmark |
| - metrollm-bench |
| - apple-silicon |
| - mac-bench |
| --- |
| |
| # MetroLLM-Bench Mac Probe |
|
|
| Slim distribution of the [MetroLLM-Bench](https://github.com/continker/metrollm-bench) |
| hardware-envelope tools for Apple Silicon. Pulls GGUF weights from |
| `continker/Qwen3.5-{2B,4B,9B}-metro-v23`, runs `llama-server` locally on |
| Metal, executes a 15-case stratified MARTA probe (and optional sustained-load |
| thermal curve), and emits JSON telemetry β decode tok/s, TTFT, peak RAM, |
| Tier-1 deterministic accuracy + Tier-2 LLM-judge composite. |
|
|
| This repo exists so corporate or ephemeral Macs can `git clone` the bench |
| without VPN access to the private project repo. |
|
|
| ## Prerequisites (one-time) |
|
|
| ```bash |
| brew install uv llama.cpp |
| export ANTHROPIC_API_KEY=sk-ant-... # for the Tier-2 LLM judge |
| ``` |
|
|
| If brew is unavailable: `uv` has a `curl -LsSf https://astral.sh/uv/install.sh | sh` |
| fallback; `llama.cpp` ships official Apple Silicon release binaries on GitHub. |
|
|
| ## Run a 15-case probe (~10-30 min depending on Mac) |
|
|
| ```bash |
| git clone https://huggingface.co/continker/metrollm-bench-mac /tmp/mac-bench |
| cd /tmp/mac-bench |
| uv sync |
| bash scripts/mac_bench/run_probe.sh 2b |
| cat results/mac_bench/*-2b-probe/telemetry.json |
| ``` |
|
|
| Output captures: |
| - `tier1_composite`, `metrollm_composite` β bench scores (deterministic + judge) |
| - `decode_tok_s_median` / `_p10` / `_p90` β single-stream Metal decode throughput |
| - `ttft_ms_median` β first-token latency end-to-end (HTTP + decode) |
| - `peak_rss_gb` β max RSS of `llama-server` during decode |
| - `runner_wallclock_s` β total wall time |
|
|
| ## Run a sustained-load thermal curve (fanless Macs only, ~45 min) |
|
|
| ```bash |
| bash scripts/mac_bench/run_thermal.sh 2b --duration 45m |
| ``` |
|
|
| Replays MARTA cases on a loop while `thermal_sampler.py` records tok/s + RSS |
| every 30 s. Captures cold β sustained β throttle behaviour. Output: |
| `results/mac_bench/<chip>-<ram>gb-2b-thermal/thermal_curve.{csv,json}`. |
|
|
| ## Cleanup |
|
|
| ```bash |
| rm -rf /tmp/mac-bench |
| ``` |
|
|
| (Optionally `brew uninstall uv llama.cpp` if you don't keep them around.) |
|
|
| ## Per-Mac context-size requirements |
|
|
| `llama.cpp` allocates the full KV cache upfront at server start. Defaults |
| already cover the measured p99 conversation length per model: |
|
|
| | Size | Default ctx | KV memory | Total RAM | |
| |---|---:|---:|---:| |
| | 2B | 32 768 | 1.21 GB | ~4 GB | |
| | 4B | 16 384 | 2.42 GB | ~6.5 GB | |
| | 9B | 16 384 | 2.42 GB | ~9.2 GB (tight on 16 GB Macs) | |
|
|
| Override with `--ctx N` (e.g. `bash scripts/mac_bench/run_probe.sh 9b --ctx 8192`). |
|
|
| ## License & attribution |
|
|
| Apache 2.0. See parent project for full citation. |
|
|