Run — WebGPU Bench

How to use this page — quick start, run modes, and configuration knobs

Quick start

The three cards above show your device, its capabilities, and the safe model budget (largest variant that should fit without exhausting GPU/WASM memory).
Pick variants in the Models panel below — or use the Quick set / All fit selection buttons.
Click Run study for a curated comparison set, Run benchmarks for whatever you've checked, or Download selected to just cache files for later.

Run modes

Run study: Curated, opinionated set: Llama-3.2-1B-Instruct at Q2_K / Q4_K_M / Q8_0 / F16, plus every other model at Q4_K_M, filtered to what fits this device. Use this for apples-to-apples comparison across machines.
Run benchmarks: Runs whatever variants you've explicitly checked in the Models panel. Use this when you want to drill into a specific model or quant family.
Download selected: Caches the checked variants to OPFS without running them. Useful for pre-warming before a flight or running offline later.
Abort: Cancels the in-flight worker AND any concurrent download. Aborted variants get an "aborted" status; subsequent variants do not run.
Purge OPFS cache: Wipes every cached GGUF from browser storage. Only shown on hosted surfaces; helpful when the device is low on disk.

What each run does

By default each variant runs the GPU perf pass only. The CPU baseline is opt-in via the Run toggles below — flip them on if you want consistency or CPU-vs-GPU comparison numbers.

GPU pass (always) — runs a llama-bench-style perf sweep: one warmup + reps timed reps for both prompt processing (pp) and token generation (tg). 1-second cooldown between reps lets the GPU clock recover, so reps stay comparable instead of decaying across the sweep.
CPU baseline (opt-in) — when Run: Consistency or Run: CPU perf is on, the CPU pass runs first to capture reference token IDs (for the GPU forced-decode agreement check) and/or a 1-rep CPU perf sample (for CPU vs GPU dashboard cells).

Configuration

Search / Hide: Filter the Models panel. Hide toggles drop UD (Unsloth dynamic), IQ (i-quants), and BF16/F16 high-precision variants — useful when you only care about the "standard" K-quant lineup.
Select: Quick set / All fit / None: Quick set = the same variants Run study uses. All fit = every variant under your device budget. None = clear the selection.
Prompt tokens (-p): How many synthetic tokens go through prompt processing for the pp test. Default 512. Larger = more compute-bound.
Gen tokens (-n): How many tokens are generated for the tg test. Default 128. Larger reps stress sustained decode bandwidth.
Reps (-r): Timed repetitions of each test (after warmup). Default 5. The reported figure is mean ± stddev across the reps.
Run: Consistency: Off by default. Turns on the CPU-baseline + GPU-forced-decode agreement check. Opt in when you want to verify CPU and GPU produce matching tokens for a given variant.
Run: CPU perf: Off by default. Turns on a 1-rep CPU perf baseline alongside the GPU pass so the dashboard's CPU vs GPU cells have data. Opt in when you want apples-to-apples CPU↔GPU numbers; leave off to keep runs short.
Evict cached GGUFs after each run: Frees OPFS storage as the queue moves through variants. Only evicts files this session downloaded — files you cached before the run are preserved.

Reading the output

The Progress table updates per-variant with live status (download → cpu → gpu → done). Final per-variant numbers fill the pp tok/s and tg tok/s columns as mean ± stddev. The Output textarea below contains the full JSON record — copy it, download it, or (on the HF Space) sign in to submit to the leaderboard.

Known issues

Safari Private Browsing: Don't use it. Private mode caps OPFS storage at a few hundred MB per tab, denies navigator.storage.persist(), and routes the file system through an ephemeral backend with stricter sync-handle limits. Downloads fail mid-stream with "operation failed for an unknown transient reason" and retries don't help — the cause isn't transient, it's the browsing mode. Switch to a regular Safari window.
Mobile tabs (iOS/Android): iOS Jetsam and Android's low-memory killer reap tabs that approach the per-process memory ceiling, often silently reloading the page. The Run page applies tighter budgets and forced-eviction on mobile to mitigate, but variants near the budget edge can still crash the tab. Run from a desktop for stable numbers.

Your machine — labels the auto-detected device data on submission. Saved between visits.

Machine name GPU name (optional) Browser Operating system

Required fields marked *. Defaults are filled in from your browser; edit anything that's wrong before running.

Run log

WASM errors

Run a benchmark