Spaces:

webml-community
/

Nemotron-3-Nano-WebGPU

Running

M4 Mac mini benchmark + UI freeze diagnosis (reasoning trace buffering)

by John-Williams-ATL - opened 29 days ago

M4 Mac mini benchmark + UI freeze diagnosis (reasoning trace buffering)

Ran a structured diagnostic session on the Space today — March 20 2026 — on an Apple M4 Mac mini 24GB. Sharing findings because the model card is currently empty and other Mac users are going to hit the same things.

Hardware / browser

Apple M4 Mac mini, 24GB unified memory
macOS Sequoia 15
Chrome 123 (WebGPU enabled)
No competing GPU processes during clean benchmark run

Benchmark result

~25 tok/s on WebGPU, uncontested GPU.

JS heap stays flat at ~16–17MB throughout — correct behaviour, model weights live in GPU memory outside the JS heap. Not a memory leak.

The freeze — diagnosed

The model is not crashing. Here is what is actually happening.

Switching to the iframe's console context (webml-community-nemotron-3-nano-webgpu.static.hf.space) and running the JSON chart prompt revealed continuous Streamed output: entries streaming correctly from index-CgnC9ASC.js:35 — the full reasoning trace followed by the correct final answer:

Streamed output: <think>
... [full reasoning trace] ...
Streamed output: </think>
Streamed output: {
Streamed output:   "labels": ["January", "February", "March", "April", "May"],
Streamed output:   "values": [500, 600, 550, 700, 800],
Streamed output:   "title": "Monthly Sales Performance"
Streamed output: }
Streamed output: <|im_end|>

The worker is completing generation successfully. The UI chat bubble never renders the output because it appears to buffer the full <think>...</think> trace before displaying — and something in the render step is not firing after the trace completes.

Workaround (works immediately): prefix your prompt with /no_think — this suppresses the reasoning trace and the response renders instantly in the UI.

Safari issue

macOS Sequoia (15) Safari throws Error: Unsupported device: "webgpu". Should be one of: wasm immediately on Load Model. WebGPU in Safari requires macOS Tahoe 26. Use Chrome on current macOS.

Suggested UI fix

The Space needs to either:

Stream tokens directly to the chat bubble as they arrive rather than buffering post-</think>, or
Strip the <think>...</think> block before rendering the chat bubble

Option 2 is the one-liner fix — filter the streamed output for the closing </think> tag and only then push to the UI, or better, render both sections with the thinking trace collapsible.

For the model card README

Happy to contribute a full README with these benchmarks, browser compat table, known issues, and usage examples. The onnx-community model card is currently empty and this model is getting downloads. Let me know if a PR is the right path or if the webml team will author it directly.

— John Williams / fxops.ai / John-Williams-ATL

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment