M4 Mac mini benchmark + UI freeze diagnosis (reasoning trace buffering)
M4 Mac mini benchmark + UI freeze diagnosis (reasoning trace buffering)
Ran a structured diagnostic session on the Space today — March 20 2026 — on an Apple M4 Mac mini 24GB. Sharing findings because the model card is currently empty and other Mac users are going to hit the same things.
Hardware / browser
- Apple M4 Mac mini, 24GB unified memory
- macOS Sequoia 15
- Chrome 123 (WebGPU enabled)
- No competing GPU processes during clean benchmark run
Benchmark result
~25 tok/s on WebGPU, uncontested GPU.
JS heap stays flat at ~16–17MB throughout — correct behaviour, model weights live in GPU memory outside the JS heap. Not a memory leak.
The freeze — diagnosed
The model is not crashing. Here is what is actually happening.
Switching to the iframe's console context (webml-community-nemotron-3-nano-webgpu.static.hf.space) and running the JSON chart prompt revealed continuous Streamed output: entries streaming correctly from index-CgnC9ASC.js:35 — the full reasoning trace followed by the correct final answer:
Streamed output: <think>
... [full reasoning trace] ...
Streamed output: </think>
Streamed output: {
Streamed output: "labels": ["January", "February", "March", "April", "May"],
Streamed output: "values": [500, 600, 550, 700, 800],
Streamed output: "title": "Monthly Sales Performance"
Streamed output: }
Streamed output: <|im_end|>
The worker is completing generation successfully. The UI chat bubble never renders the output because it appears to buffer the full <think>...</think> trace before displaying — and something in the render step is not firing after the trace completes.
Workaround (works immediately): prefix your prompt with /no_think — this suppresses the reasoning trace and the response renders instantly in the UI.
Safari issue
macOS Sequoia (15) Safari throws Error: Unsupported device: "webgpu". Should be one of: wasm immediately on Load Model. WebGPU in Safari requires macOS Tahoe 26. Use Chrome on current macOS.
Suggested UI fix
The Space needs to either:
- Stream tokens directly to the chat bubble as they arrive rather than buffering post-
</think>, or - Strip the
<think>...</think>block before rendering the chat bubble
Option 2 is the one-liner fix — filter the streamed output for the closing </think> tag and only then push to the UI, or better, render both sections with the thinking trace collapsible.
For the model card README
Happy to contribute a full README with these benchmarks, browser compat table, known issues, and usage examples. The onnx-community model card is currently empty and this model is getting downloads. Let me know if a PR is the right path or if the webml team will author it directly.
— John Williams / fxops.ai / John-Williams-ATL