JITServe: SLO-aware LLM Serving with Imprecise Request Information
Paper
β’
2504.20068
β’
Published
This repository provides the pretrained QRF (Quantile Regression Forest) length predictor used by JITServe (NSDIβ26) to estimate conservative upper bounds on LLM output lengths.
This predictor is:
It is released to ensure full reproducibility of the JITServe artifact.
This repository contains two components that must be used together:
qrf_model/
βββ 0_qrf_lmsys_chat_llama3_8b.pkl
βββ 0_qrf_lmsys_chat_qwen25_7b.pkl
qrf_vectorizer/
βββ 0_qrf_lmsys_chat_llama3_8b.pkl
βββ 0_qrf_lmsys_chat_qwen25_7b.pkl
These artifacts are consumed by JITServe at runtime.
Expected directory layout in the JITServe artifact:
assets/qrf/
βββ qrf_model/
βββ qrf_vectorizer/
After downloading this repository, place its contents under the path above.
JITServe loads the predictor automatically during startup and does not require any additional configuration by default.
If you use these artifacts, please consider to cite our paper:
@misc{zhang2025jitservesloawarellmserving,
title={JITServe: SLO-aware LLM Serving with Imprecise Request Information},
author={Wei Zhang and Zhiyu Wu and Yi Mu and Rui Ning and Banruo Liu and Nikhil Sarda and Myungjin Lee and Fan Lai},
year={2025},
eprint={2504.20068},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2504.20068},
}