File size: 2,644 Bytes
2d05890
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
license: apache-2.0
language:
- en
tags:
- transit
- kiosk
- benchmark
- metrollm-bench
- apple-silicon
- mac-bench
---

# MetroLLM-Bench Mac Probe

Slim distribution of the [MetroLLM-Bench](https://github.com/continker/metrollm-bench)
hardware-envelope tools for Apple Silicon. Pulls GGUF weights from
`continker/Qwen3.5-{2B,4B,9B}-metro-v23`, runs `llama-server` locally on
Metal, executes a 15-case stratified MARTA probe (and optional sustained-load
thermal curve), and emits JSON telemetry — decode tok/s, TTFT, peak RAM,
Tier-1 deterministic accuracy + Tier-2 LLM-judge composite.

This repo exists so corporate or ephemeral Macs can `git clone` the bench
without VPN access to the private project repo.

## Prerequisites (one-time)

```bash
brew install uv llama.cpp
export ANTHROPIC_API_KEY=sk-ant-...   # for the Tier-2 LLM judge
```

If brew is unavailable: `uv` has a `curl -LsSf https://astral.sh/uv/install.sh | sh`
fallback; `llama.cpp` ships official Apple Silicon release binaries on GitHub.

## Run a 15-case probe (~10-30 min depending on Mac)

```bash
git clone https://huggingface.co/continker/metrollm-bench-mac /tmp/mac-bench
cd /tmp/mac-bench
uv sync
bash scripts/mac_bench/run_probe.sh 2b
cat results/mac_bench/*-2b-probe/telemetry.json
```

Output captures:
- `tier1_composite`, `metrollm_composite` — bench scores (deterministic + judge)
- `decode_tok_s_median` / `_p10` / `_p90` — single-stream Metal decode throughput
- `ttft_ms_median` — first-token latency end-to-end (HTTP + decode)
- `peak_rss_gb` — max RSS of `llama-server` during decode
- `runner_wallclock_s` — total wall time

## Run a sustained-load thermal curve (fanless Macs only, ~45 min)

```bash
bash scripts/mac_bench/run_thermal.sh 2b --duration 45m
```

Replays MARTA cases on a loop while `thermal_sampler.py` records tok/s + RSS
every 30 s. Captures cold → sustained → throttle behaviour. Output:
`results/mac_bench/<chip>-<ram>gb-2b-thermal/thermal_curve.{csv,json}`.

## Cleanup

```bash
rm -rf /tmp/mac-bench
```

(Optionally `brew uninstall uv llama.cpp` if you don't keep them around.)

## Per-Mac context-size requirements

`llama.cpp` allocates the full KV cache upfront at server start. Defaults
already cover the measured p99 conversation length per model:

| Size | Default ctx | KV memory | Total RAM |
|---|---:|---:|---:|
| 2B | 32 768 | 1.21 GB | ~4 GB |
| 4B | 16 384 | 2.42 GB | ~6.5 GB |
| 9B | 16 384 | 2.42 GB | ~9.2 GB (tight on 16 GB Macs) |

Override with `--ctx N` (e.g. `bash scripts/mac_bench/run_probe.sh 9b --ctx 8192`).

## License & attribution

Apache 2.0. See parent project for full citation.