metrollm-bench-mac / README.md
Remco Hendriks
Update Mac bench dist
2d05890 verified
---
license: apache-2.0
language:
- en
tags:
- transit
- kiosk
- benchmark
- metrollm-bench
- apple-silicon
- mac-bench
---
# MetroLLM-Bench Mac Probe
Slim distribution of the [MetroLLM-Bench](https://github.com/continker/metrollm-bench)
hardware-envelope tools for Apple Silicon. Pulls GGUF weights from
`continker/Qwen3.5-{2B,4B,9B}-metro-v23`, runs `llama-server` locally on
Metal, executes a 15-case stratified MARTA probe (and optional sustained-load
thermal curve), and emits JSON telemetry β€” decode tok/s, TTFT, peak RAM,
Tier-1 deterministic accuracy + Tier-2 LLM-judge composite.
This repo exists so corporate or ephemeral Macs can `git clone` the bench
without VPN access to the private project repo.
## Prerequisites (one-time)
```bash
brew install uv llama.cpp
export ANTHROPIC_API_KEY=sk-ant-... # for the Tier-2 LLM judge
```
If brew is unavailable: `uv` has a `curl -LsSf https://astral.sh/uv/install.sh | sh`
fallback; `llama.cpp` ships official Apple Silicon release binaries on GitHub.
## Run a 15-case probe (~10-30 min depending on Mac)
```bash
git clone https://huggingface.co/continker/metrollm-bench-mac /tmp/mac-bench
cd /tmp/mac-bench
uv sync
bash scripts/mac_bench/run_probe.sh 2b
cat results/mac_bench/*-2b-probe/telemetry.json
```
Output captures:
- `tier1_composite`, `metrollm_composite` β€” bench scores (deterministic + judge)
- `decode_tok_s_median` / `_p10` / `_p90` β€” single-stream Metal decode throughput
- `ttft_ms_median` β€” first-token latency end-to-end (HTTP + decode)
- `peak_rss_gb` β€” max RSS of `llama-server` during decode
- `runner_wallclock_s` β€” total wall time
## Run a sustained-load thermal curve (fanless Macs only, ~45 min)
```bash
bash scripts/mac_bench/run_thermal.sh 2b --duration 45m
```
Replays MARTA cases on a loop while `thermal_sampler.py` records tok/s + RSS
every 30 s. Captures cold β†’ sustained β†’ throttle behaviour. Output:
`results/mac_bench/<chip>-<ram>gb-2b-thermal/thermal_curve.{csv,json}`.
## Cleanup
```bash
rm -rf /tmp/mac-bench
```
(Optionally `brew uninstall uv llama.cpp` if you don't keep them around.)
## Per-Mac context-size requirements
`llama.cpp` allocates the full KV cache upfront at server start. Defaults
already cover the measured p99 conversation length per model:
| Size | Default ctx | KV memory | Total RAM |
|---|---:|---:|---:|
| 2B | 32 768 | 1.21 GB | ~4 GB |
| 4B | 16 384 | 2.42 GB | ~6.5 GB |
| 9B | 16 384 | 2.42 GB | ~9.2 GB (tight on 16 GB Macs) |
Override with `--ctx N` (e.g. `bash scripts/mac_bench/run_probe.sh 9b --ctx 8192`).
## License & attribution
Apache 2.0. See parent project for full citation.