Spaces:

build-small-hackathon
/

FitCheck

Running on Zero

Speed predictions now come from a model trained on 6.6k real measurements (17.5% median error on unseen hardware), physics formula beyond its measured range

0935028 verified 1 day ago

preview code

raw

history blame contribute delete

2.31 kB

	---
	license: mit
	library_name: skops
	tags: [tabular-regression, xgboost, llm-inference, performance-prediction]
	---

	# FitCheck speed predictor

	Predicts local-LLM decode tokens/sec from hardware + model features.
	Part of [FitCheck](https://huggingface.co/spaces/build-small-hackathon/FitCheck),
	the honest "what AI can your computer run" advisor.

	## Method

	Gradient-boosted regression (XGBoost) following the methodology of
	LLM-Pilot (IBM, SC'24): [arXiv:2410.02425](https://arxiv.org/abs/2410.02425) —
	performance prediction for LLM inference on unseen hardware, validated
	leave-one-accelerator-out so the error below is measured on hardware the
	model never saw in training.

	Features: effective memory bandwidth, bytes read per token (weights + KV),
	weights size, KV size, MoE active fraction, offload fraction, and the
	analytical roofline prior (bandwidth / bytes). Decode is memory-bandwidth-bound;
	the model learns the residual between the roofline ideal and reality.

	## Training data

	6,633 real measurements across 595 distinct
	accelerators (consumer CPUs, Apple Silicon, NVIDIA/AMD GPUs), from the
	[LocalScore](https://www.localscore.ai) community benchmark (Mozilla Builders /
	cjpais — thank you; data attributed, not owned, takedown requests honoured).
	Trained 2026-06-10.

	## Honest holdout results (leave-one-accelerator-out)

	\| metric \| roofline baseline \| this model \|
	\|---\|---\|---\|
	\| median APE (bandwidth-known hardware) \| 28.1% \| 17.5% \|
	\| median abs error (tok/s) \| 11.63 \| 9.55 \|
	\| all hardware incl. CPUs (no baseline possible) \| — \| 23.6% median APE \|

	Shipping rule: this model is only deployed because it beat the analytical
	baseline on held-out hardware. If a retrain ever fails that gate, FitCheck
	falls back to the labelled roofline estimate.

	## Limits (read this)

	- Trained on dense LLMs running fully on-device (LocalScore's fixed grid:
	1B / 8B / 14B at Q4_K_M, varied context). The model axis generalises through
	the bytes-per-token feature, not data diversity.
	- MoE and GPU->RAM offload are corrected analytically upstream, then fed
	through — those corrections are engineering estimates, labelled as such.
	- Does NOT cover vision/diffusion models (compute-bound, different physics).