M4 Mac mini benchmark + UI freeze diagnosis (reasoning trace buffering)

#1
by John-Williams-ATL - opened

M4 Mac mini benchmark + UI freeze diagnosis (reasoning trace buffering)

Ran a structured diagnostic session on the Space today — March 20 2026 — on an Apple M4 Mac mini 24GB. Sharing findings because the model card is currently empty and other Mac users are going to hit the same things.


Hardware / browser

  • Apple M4 Mac mini, 24GB unified memory
  • macOS Sequoia 15
  • Chrome 123 (WebGPU enabled)
  • No competing GPU processes during clean benchmark run

Benchmark result

~25 tok/s on WebGPU, uncontested GPU.

JS heap stays flat at ~16–17MB throughout — correct behaviour, model weights live in GPU memory outside the JS heap. Not a memory leak.


The freeze — diagnosed

The model is not crashing. Here is what is actually happening.

Switching to the iframe's console context (webml-community-nemotron-3-nano-webgpu.static.hf.space) and running the JSON chart prompt revealed continuous Streamed output: entries streaming correctly from index-CgnC9ASC.js:35 — the full reasoning trace followed by the correct final answer:

Streamed output: <think>
... [full reasoning trace] ...
Streamed output: </think>
Streamed output: {
Streamed output:   "labels": ["January", "February", "March", "April", "May"],
Streamed output:   "values": [500, 600, 550, 700, 800],
Streamed output:   "title": "Monthly Sales Performance"
Streamed output: }
Streamed output: <|im_end|>

The worker is completing generation successfully. The UI chat bubble never renders the output because it appears to buffer the full <think>...</think> trace before displaying — and something in the render step is not firing after the trace completes.

Workaround (works immediately): prefix your prompt with /no_think — this suppresses the reasoning trace and the response renders instantly in the UI.


Safari issue

macOS Sequoia (15) Safari throws Error: Unsupported device: "webgpu". Should be one of: wasm immediately on Load Model. WebGPU in Safari requires macOS Tahoe 26. Use Chrome on current macOS.


Suggested UI fix

The Space needs to either:

  1. Stream tokens directly to the chat bubble as they arrive rather than buffering post-</think>, or
  2. Strip the <think>...</think> block before rendering the chat bubble

Option 2 is the one-liner fix — filter the streamed output for the closing </think> tag and only then push to the UI, or better, render both sections with the thinking trace collapsible.


For the model card README

Happy to contribute a full README with these benchmarks, browser compat table, known issues, and usage examples. The onnx-community model card is currently empty and this model is getting downloads. Let me know if a PR is the right path or if the webml team will author it directly.

— John Williams / fxops.ai / John-Williams-ATL

Sign up or log in to comment