--- license: apache-2.0 language: - en tags: - transit - kiosk - benchmark - metrollm-bench - apple-silicon - mac-bench --- # MetroLLM-Bench Mac Probe Slim distribution of the [MetroLLM-Bench](https://github.com/continker/metrollm-bench) hardware-envelope tools for Apple Silicon. Pulls GGUF weights from `continker/Qwen3.5-{2B,4B,9B}-metro-v23`, runs `llama-server` locally on Metal, executes a 15-case stratified MARTA probe (and optional sustained-load thermal curve), and emits JSON telemetry — decode tok/s, TTFT, peak RAM, Tier-1 deterministic accuracy + Tier-2 LLM-judge composite. This repo exists so corporate or ephemeral Macs can `git clone` the bench without VPN access to the private project repo. ## Prerequisites (one-time) ```bash brew install uv llama.cpp export ANTHROPIC_API_KEY=sk-ant-... # for the Tier-2 LLM judge ``` If brew is unavailable: `uv` has a `curl -LsSf https://astral.sh/uv/install.sh | sh` fallback; `llama.cpp` ships official Apple Silicon release binaries on GitHub. ## Run a 15-case probe (~10-30 min depending on Mac) ```bash git clone https://huggingface.co/continker/metrollm-bench-mac /tmp/mac-bench cd /tmp/mac-bench uv sync bash scripts/mac_bench/run_probe.sh 2b cat results/mac_bench/*-2b-probe/telemetry.json ``` Output captures: - `tier1_composite`, `metrollm_composite` — bench scores (deterministic + judge) - `decode_tok_s_median` / `_p10` / `_p90` — single-stream Metal decode throughput - `ttft_ms_median` — first-token latency end-to-end (HTTP + decode) - `peak_rss_gb` — max RSS of `llama-server` during decode - `runner_wallclock_s` — total wall time ## Run a sustained-load thermal curve (fanless Macs only, ~45 min) ```bash bash scripts/mac_bench/run_thermal.sh 2b --duration 45m ``` Replays MARTA cases on a loop while `thermal_sampler.py` records tok/s + RSS every 30 s. Captures cold → sustained → throttle behaviour. Output: `results/mac_bench/-gb-2b-thermal/thermal_curve.{csv,json}`. ## Cleanup ```bash rm -rf /tmp/mac-bench ``` (Optionally `brew uninstall uv llama.cpp` if you don't keep them around.) ## Per-Mac context-size requirements `llama.cpp` allocates the full KV cache upfront at server start. Defaults already cover the measured p99 conversation length per model: | Size | Default ctx | KV memory | Total RAM | |---|---:|---:|---:| | 2B | 32 768 | 1.21 GB | ~4 GB | | 4B | 16 384 | 2.42 GB | ~6.5 GB | | 9B | 16 384 | 2.42 GB | ~9.2 GB (tight on 16 GB Macs) | Override with `--ctx N` (e.g. `bash scripts/mac_bench/run_probe.sh 9b --ctx 8192`). ## License & attribution Apache 2.0. See parent project for full citation.