FoolDev Claude Opus 4.7 commited on
Commit
d344201
·
1 Parent(s): 6fa50d3

Add bench.sh + make bench for measured tok/s

Browse files

The README's hardware table called itself estimates ("Numbers above are
estimates" in Known limitations). This wires up Ollama's authoritative
eval_count / eval_duration metadata into a repeatable benchmark so
anyone can replace estimates with measurements for their own machine.

The script:
- warms up first (model-load cost shouldn't pollute the first sample)
- mixes short / medium / long output prompts (single shape would lock
the average onto one decode pattern)
- reads timings from Ollama's response, not a client stopwatch (no
network or jq overhead in the number)

Reference data point added inline to the hardware section: Ryzen AI
Max+ 395 / Radeon 8060S iGPU at Q3_K_S clocks ~4.5 tok/s — sits
between the CPU-only 1-3 tok/s row and the 24 GB discrete card row,
which matches expectation for an iGPU with unified memory.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (4) hide show
  1. CHANGELOG.md +8 -0
  2. Makefile +4 -1
  3. README.md +10 -0
  4. scripts/bench.sh +98 -0
CHANGELOG.md CHANGED
@@ -25,6 +25,14 @@ and documentation**, not the underlying base model.
25
  user turn. Reproducible via `make smoke` before/after.
26
 
27
  ### Added
 
 
 
 
 
 
 
 
28
  - `scripts/smoke_test.sh`: token-leakage guard. The previous round-trip
29
  check passed on any non-empty response — including the broken EOS-bleed
30
  output that motivated the stop-token fix above. The smoke test now
 
25
  user turn. Reproducible via `make smoke` before/after.
26
 
27
  ### Added
28
+ - `scripts/bench.sh` + `make bench`: tok/s benchmark using Ollama's
29
+ `eval_count` / `eval_duration` response metadata (avoids client-side
30
+ stopwatch noise). Mixes short / medium / long prompts so the average
31
+ doesn't lock onto a single output shape, and warms up first so the
32
+ model-load cost doesn't pollute the first sample. Reference data:
33
+ Ryzen AI Max+ 395 / Radeon 8060S iGPU clocks ~4.5 tok/s at Q3_K_S.
34
+ README hardware section now points readers at `make bench` to measure
35
+ their own rather than trusting the estimated table.
36
  - `scripts/smoke_test.sh`: token-leakage guard. The previous round-trip
37
  check passed on any non-empty response — including the broken EOS-bleed
38
  output that motivated the stop-token fix above. The smoke test now
Makefile CHANGED
@@ -25,7 +25,7 @@ MODEL ?= $(TAG)
25
 
26
  PRECISION ?= F16
27
 
28
- .PHONY: help build smoke check hooks mmproj clean
29
 
30
  help: ## Show this help.
31
  @awk 'BEGIN {FS = ":.*##"; printf "Targets:\n"} /^[a-zA-Z_-]+:.*?##/ { printf " \033[36m%-12s\033[0m %s\n", $$1, $$2 }' $(MAKEFILE_LIST)
@@ -42,6 +42,9 @@ build: ## Download GGUF (if needed) and run 'ollama create'.
42
  smoke: ## Verify the model is reachable and round-trips.
43
  MODEL=$(MODEL) ./scripts/smoke_test.sh
44
 
 
 
 
45
  mmproj: ## Fetch the vision projector for llama.cpp (Ollama vision is broken upstream).
46
  ./scripts/fetch_mmproj.sh $(PRECISION)
47
 
 
25
 
26
  PRECISION ?= F16
27
 
28
+ .PHONY: help build smoke bench check hooks mmproj clean
29
 
30
  help: ## Show this help.
31
  @awk 'BEGIN {FS = ":.*##"; printf "Targets:\n"} /^[a-zA-Z_-]+:.*?##/ { printf " \033[36m%-12s\033[0m %s\n", $$1, $$2 }' $(MAKEFILE_LIST)
 
42
  smoke: ## Verify the model is reachable and round-trips.
43
  MODEL=$(MODEL) ./scripts/smoke_test.sh
44
 
45
+ bench: ## Measure tok/s using Ollama's eval timing (3 prompts).
46
+ MODEL=$(MODEL) ./scripts/bench.sh
47
+
48
  mmproj: ## Fetch the vision projector for llama.cpp (Ollama vision is broken upstream).
49
  ./scripts/fetch_mmproj.sh $(PRECISION)
50
 
README.md CHANGED
@@ -107,6 +107,7 @@ The 27B is **dense**: every parameter participates in every forward pass. It's s
107
  | `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
108
  | `scripts/build.sh` | One-shot helper: pulls a GGUF and runs `ollama create` for you |
109
  | `scripts/smoke_test.sh` | Verifies an Ollama daemon + model, runs a round-trip, and asserts no chat-template tokens leak into the response |
 
110
  | `scripts/fetch_mmproj.sh` | Pulls the vision projector for llama.cpp (Ollama vision is broken upstream — see [Vision](#vision)) |
111
  | `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep |
112
  | `scripts/install-hooks.sh` | Installs `check.sh` as a git pre-commit hook |
@@ -257,6 +258,15 @@ The dense 27B is the easier of the two Janus models to deploy.
257
  | Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
258
  | 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. Drop to Q3_K_S (~12 GB) and trim `num_ctx` for headroom. |
259
 
 
 
 
 
 
 
 
 
 
260
  ## Chat template
261
 
262
  Standard Qwen 3.x ChatML with `<|im_start|>` / `<|im_end|>` role markers,
 
107
  | `examples/` | Ready-to-run Python clients for Ollama, Transformers, and llama-cpp-python |
108
  | `scripts/build.sh` | One-shot helper: pulls a GGUF and runs `ollama create` for you |
109
  | `scripts/smoke_test.sh` | Verifies an Ollama daemon + model, runs a round-trip, and asserts no chat-template tokens leak into the response |
110
+ | `scripts/bench.sh` | Measures real tok/s using Ollama's `eval_count` / `eval_duration` metadata over a 3-prompt mix (run `make bench`) |
111
  | `scripts/fetch_mmproj.sh` | Pulls the vision projector for llama.cpp (Ollama vision is broken upstream — see [Vision](#vision)) |
112
  | `scripts/check.sh` | Local lint: `bash -n`, `pyflakes`, `py_compile`, footgun-grep |
113
  | `scripts/install-hooks.sh` | Installs `check.sh` as a git pre-commit hook |
 
258
  | Mac Studio M2/M3 32 GB+ unified | Works, ~15-25 tok/s |
259
  | 32 GB unified-memory laptops (Mac M-series, Ryzen AI Max+, etc.) | Borderline at Q4. Drop to Q3_K_S (~12 GB) and trim `num_ctx` for headroom. |
260
 
261
+ Most numbers in this table are estimates from comparable models; the
262
+ gradient is right but the absolute values will move ±20% with prompt
263
+ shape, KV cache type, and parallel-request count. Measure your own
264
+ machine with `make bench` (3-prompt mix, reports tok/s from Ollama's
265
+ `eval_count` / `eval_duration` so it's not stopwatch-noisy). Reference
266
+ data point: a Ryzen AI Max+ 395 / Radeon 8060S iGPU at Q3_K_S clocks
267
+ **~4.5 tok/s**, sitting between CPU-only and a 24 GB discrete card as
268
+ expected.
269
+
270
  ## Chat template
271
 
272
  Standard Qwen 3.x ChatML with `<|im_start|>` / `<|im_end|>` role markers,
scripts/bench.sh ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Janus-27B — tok/s benchmark via Ollama.
3
+ #
4
+ # Reads timing from Ollama's /api/chat response metadata (eval_count and
5
+ # eval_duration are authoritative — no client-side stopwatch noise) and
6
+ # averages over a handful of prompts that vary in output length so the
7
+ # number generalises a bit beyond a single shape.
8
+ #
9
+ # Usage:
10
+ # ./scripts/bench.sh # uses MODEL=janus-27b
11
+ # MODEL=janus-27b NPROMPTS=5 ./scripts/bench.sh
12
+ # HOST=http://localhost:11434 ./scripts/bench.sh
13
+ #
14
+ # Requires: curl, jq, a running Ollama daemon with the model created.
15
+ set -euo pipefail
16
+
17
+ MODEL="${MODEL:-janus-27b}"
18
+ HOST="${HOST:-http://localhost:11434}"
19
+
20
+ red() { printf "\033[31m%s\033[0m\n" "$*"; }
21
+ green() { printf "\033[32m%s\033[0m\n" "$*"; }
22
+ blue() { printf "\033[34m%s\033[0m\n" "$*"; }
23
+
24
+ for dep in curl jq; do
25
+ if ! command -v "$dep" >/dev/null 2>&1; then
26
+ red "[!] missing dependency: $dep"; exit 1
27
+ fi
28
+ done
29
+
30
+ if ! curl -fsS "${HOST}/api/tags" >/dev/null; then
31
+ red "[!] Ollama not reachable at ${HOST}"
32
+ exit 1
33
+ fi
34
+ if ! curl -fsS "${HOST}/api/tags" | jq -e --arg m "${MODEL}" '.models[] | select(.name | startswith($m))' >/dev/null; then
35
+ red "[!] Model '${MODEL}' not found. Build it first: ./scripts/build.sh"
36
+ exit 1
37
+ fi
38
+
39
+ # Mix of short / medium / long output lengths — single shape would skew
40
+ # the average toward whatever the model decides to do for that prompt.
41
+ PROMPTS=(
42
+ "Reply with only the word OK."
43
+ "Explain the time complexity of mergesort in one short paragraph."
44
+ "Write a 120-word explanation of what a Bloom filter is and when to use it."
45
+ )
46
+
47
+ blue "[*] host: ${HOST}"
48
+ blue "[*] model: ${MODEL}"
49
+ blue "[*] prompts: ${#PROMPTS[@]}"
50
+ echo
51
+
52
+ # Warmup — first call pays the model-load cost; we don't want that in
53
+ # the average. Result is discarded.
54
+ blue "[*] warmup..."
55
+ curl -fsS "${HOST}/api/chat" \
56
+ -H 'Content-Type: application/json' \
57
+ -d "$(jq -n --arg m "${MODEL}" '{
58
+ model: $m,
59
+ messages: [{role:"user", content:"warmup"}],
60
+ stream: false
61
+ }')" >/dev/null
62
+
63
+ TOTAL_TOKENS=0
64
+ TOTAL_NS=0
65
+
66
+ printf "%-4s %8s %12s %8s\n" "#" "tokens" "eval_ms" "tok/s"
67
+ printf "%-4s %8s %12s %8s\n" "----" "--------" "------------" "--------"
68
+
69
+ for i in "${!PROMPTS[@]}"; do
70
+ prompt="${PROMPTS[$i]}"
71
+ resp="$(curl -fsS "${HOST}/api/chat" \
72
+ -H 'Content-Type: application/json' \
73
+ -d "$(jq -n --arg m "${MODEL}" --arg p "$prompt" '{
74
+ model: $m,
75
+ messages: [{role:"user", content:$p}],
76
+ stream: false
77
+ }')")"
78
+
79
+ eval_count="$(jq -r '.eval_count // 0' <<<"$resp")"
80
+ eval_ns="$(jq -r '.eval_duration // 0' <<<"$resp")"
81
+
82
+ if [[ "$eval_count" -eq 0 || "$eval_ns" -eq 0 ]]; then
83
+ red "[!] prompt $((i+1)) returned no timing data"
84
+ echo "$resp" | jq -r '.message.content // .' | head -3
85
+ exit 1
86
+ fi
87
+
88
+ eval_ms=$(( eval_ns / 1000000 ))
89
+ toks_per_s="$(jq -n --argjson c "$eval_count" --argjson n "$eval_ns" '($c / ($n / 1000000000)) | . * 100 | floor / 100')"
90
+ printf "%-4s %8s %12s %8s\n" "$((i+1))" "$eval_count" "$eval_ms" "$toks_per_s"
91
+
92
+ TOTAL_TOKENS=$(( TOTAL_TOKENS + eval_count ))
93
+ TOTAL_NS=$(( TOTAL_NS + eval_ns ))
94
+ done
95
+
96
+ echo
97
+ avg="$(jq -n --argjson c "$TOTAL_TOKENS" --argjson n "$TOTAL_NS" '($c / ($n / 1000000000)) | . * 100 | floor / 100')"
98
+ green "[+] aggregate: ${TOTAL_TOKENS} tokens / $(( TOTAL_NS / 1000000 )) ms = ${avg} tok/s"