Run a benchmark
…How to use this page — quick start, run modes, and configuration knobs
Quick start
- The three cards above show your device, its capabilities, and the safe model budget (largest variant that should fit without exhausting GPU/WASM memory).
- Pick variants in the Models panel below — or use the Quick set / All fit selection buttons.
- Click Run study for a curated comparison set, Run benchmarks for whatever you've checked, or Download selected to just cache files for later.
Run modes
- Run study
- Curated, opinionated set:
Llama-3.2-1B-InstructatQ2_K/Q4_K_M/Q8_0/F16, plus every other model atQ4_K_M, filtered to what fits this device. Use this for apples-to-apples comparison across machines. - Run benchmarks
- Runs whatever variants you've explicitly checked in the Models panel. Use this when you want to drill into a specific model or quant family.
- Download selected
- Caches the checked variants to OPFS without running them. Useful for pre-warming before a flight or running offline later.
- Abort
- Cancels the in-flight worker AND any concurrent download. Aborted variants get an "aborted" status; subsequent variants do not run.
- Purge OPFS cache
- Wipes every cached GGUF from browser storage. Only shown on hosted surfaces; helpful when the device is low on disk.
What each run does
By default each variant runs the GPU perf pass only. The CPU baseline is opt-in via the Run toggles below — flip them on if you want consistency or CPU-vs-GPU comparison numbers.
- GPU pass (always) — runs a llama-bench-style perf sweep: one warmup +
repstimed reps for both prompt processing (pp) and token generation (tg). 1-second cooldown between reps lets the GPU clock recover, so reps stay comparable instead of decaying across the sweep. - CPU baseline (opt-in) — when Run: Consistency or Run: CPU perf is on, the CPU pass runs first to capture reference token IDs (for the GPU forced-decode agreement check) and/or a 1-rep CPU perf sample (for CPU vs GPU dashboard cells).
Configuration
- Search / Hide
- Filter the Models panel. Hide toggles drop UD (Unsloth dynamic), IQ (i-quants), and BF16/F16 high-precision variants — useful when you only care about the "standard" K-quant lineup.
- Select: Quick set / All fit / None
- Quick set = the same variants Run study uses. All fit = every variant under your device budget. None = clear the selection.
- Prompt tokens (-p)
- How many synthetic tokens go through prompt processing for the
pptest. Default512. Larger = more compute-bound. - Gen tokens (-n)
- How many tokens are generated for the
tgtest. Default128. Larger reps stress sustained decode bandwidth. - Reps (-r)
- Timed repetitions of each test (after warmup). Default
5. The reported figure is mean ± stddev across the reps. - Run: Consistency
- Off by default. Turns on the CPU-baseline + GPU-forced-decode agreement check. Opt in when you want to verify CPU and GPU produce matching tokens for a given variant.
- Run: CPU perf
- Off by default. Turns on a 1-rep CPU perf baseline alongside the GPU pass so the dashboard's CPU vs GPU cells have data. Opt in when you want apples-to-apples CPU↔GPU numbers; leave off to keep runs short.
- Evict cached GGUFs after each run
- Frees OPFS storage as the queue moves through variants. Only evicts files this session downloaded — files you cached before the run are preserved.
Reading the output
The Progress table updates per-variant with live status (download → cpu → gpu → done). Final per-variant numbers fill the pp tok/s and tg tok/s columns as mean ± stddev. The Output textarea below contains the full JSON record — copy it, download it, or (on the HF Space) sign in to submit to the leaderboard.
Known issues
- Safari Private Browsing
- Don't use it. Private mode caps OPFS storage at a few hundred MB per tab, denies
navigator.storage.persist(), and routes the file system through an ephemeral backend with stricter sync-handle limits. Downloads fail mid-stream with "operation failed for an unknown transient reason" and retries don't help — the cause isn't transient, it's the browsing mode. Switch to a regular Safari window. - Mobile tabs (iOS/Android)
- iOS Jetsam and Android's low-memory killer reap tabs that approach the per-process memory ceiling, often silently reloading the page. The Run page applies tighter budgets and forced-eviction on mobile to mitigate, but variants near the budget edge can still crash the tab. Run from a desktop for stable numbers.
Device
Browser—
Platform—
GPU—
Capability
deviceMemory—
WebGPU—
llama.cpp—
Model budget
Max size—
Your machine — labels the auto-detected device data on submission. Saved between visits.
Required fields marked *. Defaults are filled in from your browser; edit anything that's wrong before running.
Loading models…