continker
/

metrollm-bench-mac

Model card Files Files and versions

metrollm-bench-mac / README.md

Remco Hendriks

Update Mac bench dist

2d05890 verified 23 days ago

|

history blame contribute delete

2.64 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- transit
	- kiosk
	- benchmark
	- metrollm-bench
	- apple-silicon
	- mac-bench
	---

	# MetroLLM-Bench Mac Probe

	Slim distribution of the [MetroLLM-Bench](https://github.com/continker/metrollm-bench)
	hardware-envelope tools for Apple Silicon. Pulls GGUF weights from
	`continker/Qwen3.5-{2B,4B,9B}-metro-v23`, runs `llama-server` locally on
	Metal, executes a 15-case stratified MARTA probe (and optional sustained-load
	thermal curve), and emits JSON telemetry — decode tok/s, TTFT, peak RAM,
	Tier-1 deterministic accuracy + Tier-2 LLM-judge composite.

	This repo exists so corporate or ephemeral Macs can `git clone` the bench
	without VPN access to the private project repo.

	## Prerequisites (one-time)

	```bash
	brew install uv llama.cpp
	export ANTHROPIC_API_KEY=sk-ant-... # for the Tier-2 LLM judge
	```

	If brew is unavailable: `uv` has a `curl -LsSf https://astral.sh/uv/install.sh \| sh`
	fallback; `llama.cpp` ships official Apple Silicon release binaries on GitHub.

	## Run a 15-case probe (~10-30 min depending on Mac)

	```bash
	git clone https://huggingface.co/continker/metrollm-bench-mac /tmp/mac-bench
	cd /tmp/mac-bench
	uv sync
	bash scripts/mac_bench/run_probe.sh 2b
	cat results/mac_bench/*-2b-probe/telemetry.json
	```

	Output captures:
	- `tier1_composite`, `metrollm_composite` — bench scores (deterministic + judge)
	- `decode_tok_s_median` / `_p10` / `_p90` — single-stream Metal decode throughput
	- `ttft_ms_median` — first-token latency end-to-end (HTTP + decode)
	- `peak_rss_gb` — max RSS of `llama-server` during decode
	- `runner_wallclock_s` — total wall time

	## Run a sustained-load thermal curve (fanless Macs only, ~45 min)

	```bash
	bash scripts/mac_bench/run_thermal.sh 2b --duration 45m
	```

	Replays MARTA cases on a loop while `thermal_sampler.py` records tok/s + RSS
	every 30 s. Captures cold → sustained → throttle behaviour. Output:
	`results/mac_bench/<chip>-<ram>gb-2b-thermal/thermal_curve.{csv,json}`.

	## Cleanup

	```bash
	rm -rf /tmp/mac-bench
	```

	(Optionally `brew uninstall uv llama.cpp` if you don't keep them around.)

	## Per-Mac context-size requirements

	`llama.cpp` allocates the full KV cache upfront at server start. Defaults
	already cover the measured p99 conversation length per model:

	\| Size \| Default ctx \| KV memory \| Total RAM \|
	\|---\|---:\|---:\|---:\|
	\| 2B \| 32 768 \| 1.21 GB \| ~4 GB \|
	\| 4B \| 16 384 \| 2.42 GB \| ~6.5 GB \|
	\| 9B \| 16 384 \| 2.42 GB \| ~9.2 GB (tight on 16 GB Macs) \|

	Override with `--ctx N` (e.g. `bash scripts/mac_bench/run_probe.sh 9b --ctx 8192`).

	## License & attribution

	Apache 2.0. See parent project for full citation.