Spaces:
Running on Zero
Running on Zero
File size: 2,307 Bytes
0935028 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | ---
license: mit
library_name: skops
tags: [tabular-regression, xgboost, llm-inference, performance-prediction]
---
# FitCheck speed predictor
Predicts local-LLM **decode tokens/sec** from hardware + model features.
Part of [FitCheck](https://huggingface.co/spaces/build-small-hackathon/FitCheck),
the honest "what AI can your computer run" advisor.
## Method
Gradient-boosted regression (XGBoost) following the methodology of
**LLM-Pilot** (IBM, SC'24): [arXiv:2410.02425](https://arxiv.org/abs/2410.02425) —
performance prediction for LLM inference on unseen hardware, validated
**leave-one-accelerator-out** so the error below is measured on hardware the
model never saw in training.
Features: effective memory bandwidth, bytes read per token (weights + KV),
weights size, KV size, MoE active fraction, offload fraction, and the
analytical roofline prior (bandwidth / bytes). Decode is memory-bandwidth-bound;
the model learns the residual between the roofline ideal and reality.
## Training data
6,633 real measurements across 595 distinct
accelerators (consumer CPUs, Apple Silicon, NVIDIA/AMD GPUs), from the
[LocalScore](https://www.localscore.ai) community benchmark (Mozilla Builders /
cjpais — thank you; data attributed, not owned, takedown requests honoured).
Trained 2026-06-10.
## Honest holdout results (leave-one-accelerator-out)
| metric | roofline baseline | this model |
|---|---|---|
| median APE (bandwidth-known hardware) | 28.1% | 17.5% |
| median abs error (tok/s) | 11.63 | 9.55 |
| all hardware incl. CPUs (no baseline possible) | — | 23.6% median APE |
**Shipping rule:** this model is only deployed because it beat the analytical
baseline on held-out hardware. If a retrain ever fails that gate, FitCheck
falls back to the labelled roofline estimate.
## Limits (read this)
- Trained on **dense LLMs running fully on-device** (LocalScore's fixed grid:
1B / 8B / 14B at Q4_K_M, varied context). The model axis generalises through
the bytes-per-token feature, not data diversity.
- MoE and GPU->RAM offload are corrected analytically upstream, then fed
through — those corrections are engineering estimates, labelled as such.
- Does NOT cover vision/diffusion models (compute-bound, different physics).
|