Spaces:
Running on Zero
Running on Zero
| license: mit | |
| library_name: skops | |
| tags: [tabular-regression, xgboost, llm-inference, performance-prediction] | |
| # FitCheck speed predictor | |
| Predicts local-LLM **decode tokens/sec** from hardware + model features. | |
| Part of [FitCheck](https://huggingface.co/spaces/build-small-hackathon/FitCheck), | |
| the honest "what AI can your computer run" advisor. | |
| ## Method | |
| Gradient-boosted regression (XGBoost) following the methodology of | |
| **LLM-Pilot** (IBM, SC'24): [arXiv:2410.02425](https://arxiv.org/abs/2410.02425) — | |
| performance prediction for LLM inference on unseen hardware, validated | |
| **leave-one-accelerator-out** so the error below is measured on hardware the | |
| model never saw in training. | |
| Features: effective memory bandwidth, bytes read per token (weights + KV), | |
| weights size, KV size, MoE active fraction, offload fraction, and the | |
| analytical roofline prior (bandwidth / bytes). Decode is memory-bandwidth-bound; | |
| the model learns the residual between the roofline ideal and reality. | |
| ## Training data | |
| 6,633 real measurements across 595 distinct | |
| accelerators (consumer CPUs, Apple Silicon, NVIDIA/AMD GPUs), from the | |
| [LocalScore](https://www.localscore.ai) community benchmark (Mozilla Builders / | |
| cjpais — thank you; data attributed, not owned, takedown requests honoured). | |
| Trained 2026-06-10. | |
| ## Honest holdout results (leave-one-accelerator-out) | |
| | metric | roofline baseline | this model | | |
| |---|---|---| | |
| | median APE (bandwidth-known hardware) | 28.1% | 17.5% | | |
| | median abs error (tok/s) | 11.63 | 9.55 | | |
| | all hardware incl. CPUs (no baseline possible) | — | 23.6% median APE | | |
| **Shipping rule:** this model is only deployed because it beat the analytical | |
| baseline on held-out hardware. If a retrain ever fails that gate, FitCheck | |
| falls back to the labelled roofline estimate. | |
| ## Limits (read this) | |
| - Trained on **dense LLMs running fully on-device** (LocalScore's fixed grid: | |
| 1B / 8B / 14B at Q4_K_M, varied context). The model axis generalises through | |
| the bytes-per-token feature, not data diversity. | |
| - MoE and GPU->RAM offload are corrected analytically upstream, then fed | |
| through — those corrections are engineering estimates, labelled as such. | |
| - Does NOT cover vision/diffusion models (compute-bound, different physics). | |