WebGPU Bench

Run a benchmark

How to use this page — quick start, run modes, and configuration knobs

Quick start

  1. The three cards above show your device, its capabilities, and the safe model budget (largest variant that should fit without exhausting GPU/WASM memory).
  2. Pick variants in the Models panel below — or use the Quick set / All fit selection buttons.
  3. Click Run study for a curated comparison set, Run benchmarks for whatever you've checked, or Download selected to just cache files for later.

Run modes

Run study
Curated, opinionated set: Llama-3.2-1B-Instruct at Q2_K / Q4_K_M / Q8_0 / F16, plus every other model at Q4_K_M, filtered to what fits this device. Use this for apples-to-apples comparison across machines.
Run benchmarks
Runs whatever variants you've explicitly checked in the Models panel. Use this when you want to drill into a specific model or quant family.
Download selected
Caches the checked variants to OPFS without running them. Useful for pre-warming before a flight or running offline later.
Abort
Cancels the in-flight worker AND any concurrent download. Aborted variants get an "aborted" status; subsequent variants do not run.
Purge OPFS cache
Wipes every cached GGUF from browser storage. Only shown on hosted surfaces; helpful when the device is low on disk.

What each run does

By default each variant runs the GPU perf pass only. The CPU baseline is opt-in via the Run toggles below — flip them on if you want consistency or CPU-vs-GPU comparison numbers.

  1. GPU pass (always) — runs a llama-bench-style perf sweep: one warmup + reps timed reps for both prompt processing (pp) and token generation (tg). 1-second cooldown between reps lets the GPU clock recover, so reps stay comparable instead of decaying across the sweep.
  2. CPU baseline (opt-in) — when Run: Consistency or Run: CPU perf is on, the CPU pass runs first to capture reference token IDs (for the GPU forced-decode agreement check) and/or a 1-rep CPU perf sample (for CPU vs GPU dashboard cells).

Configuration

Search / Hide
Filter the Models panel. Hide toggles drop UD (Unsloth dynamic), IQ (i-quants), and BF16/F16 high-precision variants — useful when you only care about the "standard" K-quant lineup.
Select: Quick set / All fit / None
Quick set = the same variants Run study uses. All fit = every variant under your device budget. None = clear the selection.
Prompt tokens (-p)
How many synthetic tokens go through prompt processing for the pp test. Default 512. Larger = more compute-bound.
Gen tokens (-n)
How many tokens are generated for the tg test. Default 128. Larger reps stress sustained decode bandwidth.
Reps (-r)
Timed repetitions of each test (after warmup). Default 5. The reported figure is mean ± stddev across the reps.
Run: Consistency
Off by default. Turns on the CPU-baseline + GPU-forced-decode agreement check. Opt in when you want to verify CPU and GPU produce matching tokens for a given variant.
Run: CPU perf
Off by default. Turns on a 1-rep CPU perf baseline alongside the GPU pass so the dashboard's CPU vs GPU cells have data. Opt in when you want apples-to-apples CPU↔GPU numbers; leave off to keep runs short.
Evict cached GGUFs after each run
Frees OPFS storage as the queue moves through variants. Only evicts files this session downloaded — files you cached before the run are preserved.

Reading the output

The Progress table updates per-variant with live status (download → cpu → gpu → done). Final per-variant numbers fill the pp tok/s and tg tok/s columns as mean ± stddev. The Output textarea below contains the full JSON record — copy it, download it, or (on the HF Space) sign in to submit to the leaderboard.

Known issues

Safari Private Browsing
Don't use it. Private mode caps OPFS storage at a few hundred MB per tab, denies navigator.storage.persist(), and routes the file system through an ephemeral backend with stricter sync-handle limits. Downloads fail mid-stream with "operation failed for an unknown transient reason" and retries don't help — the cause isn't transient, it's the browsing mode. Switch to a regular Safari window.
Mobile tabs (iOS/Android)
iOS Jetsam and Android's low-memory killer reap tabs that approach the per-process memory ceiling, often silently reloading the page. The Run page applies tighter budgets and forced-eviction on mobile to mitigate, but variants near the budget edge can still crash the tab. Run from a desktop for stable numbers.
Device
Browser
Platform
GPU
Capability
deviceMemory
WebGPU
llama.cpp
Model budget
Max size
Your machine — labels the auto-detected device data on submission. Saved between visits.

Required fields marked *. Defaults are filled in from your browser; edit anything that's wrong before running.

Hide
Select
Run
Loading models…

Progress

Output

Run log

        
WASM errors