Clairvoyant β LLM Response Length Predictor (ONNX)
XGBoost classifier that predicts whether an LLM will produce a short, medium, or long response β from the prompt alone, in 0.029 ms, no tokeniser required. Designed to power Shortest-Job-First (SJF) scheduling in serial LLM backends to reduce head-of-line blocking.
Paper: Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends
Code: github.com/Aravind0403/clairvoyant-scheduler
What This Is
Serial LLM backends (Ollama, llama.cpp in default mode) process one request at a time. A single long request blocks all shorter ones behind it β classic head-of-line blocking. Clairvoyant predicts response length at admission time and dispatches in SJF order, with a starvation timeout to protect long requests.
This repo contains the ONNX prediction models only. The Go sidecar proxy that uses them lives in the code repo above.
Scope:
β
Serial/low-concurrency backends β Ollama, llama.cpp defaults, local enterprise servers
β
Edge devices and privacy-sensitive deployments (CPU-only, no network calls)
β Not for vLLM, TGI, Orca, or any backend with native continuous batching (SJF adds no benefit there)
Models
Three models are provided, each trained on a different dataset. Model B is recommended for general use.
| File | Training Data | Ranking Acc (in-dist) | Class Acc (in-dist) | Recommended |
|---|---|---|---|---|
predictor.onnx (Model A) |
ShareGPT | 76.3% | 47.6% | General fallback |
predictor_model_b.onnx (Model B) |
LMSYS-Chat-1M | 95.6% | 66.8% | β Default |
predictor_oasst1.onnx (Model C) |
OASST1 | 62.2% | 41.0% | Not recommended |
Cross-distribution ranking accuracy for all models: 52β66% (graceful degradation toward FCFS at ~50%).
Ranking accuracy (the operationally relevant metric) measures whether the model correctly orders two requests by length β what matters for SJF dispatch. Class accuracy measures exact 3-class prediction.
Model C is included for reproducibility. Its lower accuracy means it degrades toward FCFS behaviour rather than causing harm, but Model B is strictly better for all tested distributions.
Output Classes
| Class | Token range | Scheduling action |
|---|---|---|
| Short | < 200 tokens | Dispatch first |
| Medium | 200β800 tokens | Dispatch in order |
| Long | β₯ 800 tokens | Dispatch last (or defer until Ο) |
Token count is approximated as len(response) // 4 (BPE proxy, Β±10% for English prose).
Features
19 lexical features extracted from the prompt β no tokeniser, no model call:
| # | Feature | Type |
|---|---|---|
| 1 | prompt_token_len |
Approx token count (len // 4) |
| 2 | has_code_keyword |
Binary β code/function/sql/regex/etc. present |
| 3 | has_length_constraint |
Binary β "in N words", "briefly", "tl;dr", etc. |
| 4 | ends_with_question |
Binary β prompt ends with ? |
| 5 | has_format_keyword |
Binary β json/csv/yaml/table/etc. present |
| 6 | clause_count |
Count of clauses split by punctuation |
| 7β19 | verb_* |
One-hot over 13 instruction verbs: what, write, explain, summarize, how, list, implement, compare, describe, generate, why, define, other |
Feature extraction is in feature_extractor.py.
Quick Start
import onnxruntime as ort
import numpy as np
from feature_extractor import extract_features
# Load Model B (recommended)
session = ort.InferenceSession("predictor_model_b.onnx")
input_name = session.get_inputs()[0].name
prompt = "Write a React tic-tac-toe app with Redux and full unit tests."
features = extract_features(prompt)
probs = session.run(None, {input_name: np.array([features], dtype=np.float32)})[1][0]
p_short, p_medium, p_long = probs
print(f"P(Short)={p_short:.3f} P(Medium)={p_medium:.3f} P(Long)={p_long:.3f}")
# SJF dispatch key: lower priority = dispatch last
priority = p_long # queue requests by ascending p_long
print(f"Dispatch priority key: {priority:.3f} β {'defer' if priority > 0.5 else 'dispatch now'}")
Starvation Timeout
To prevent long requests from waiting indefinitely, set a timeout:
tau = 3 * mean_short_latency # measure mean_short_latency on your hardware
# When a request has waited longer than tau, promote it to front of queue
Benchmark Results
Evaluated on an RTX 4090 with Ollama (n=250 per cell, Poisson arrivals).
Steady-state (Poisson Ο = 0.74)
| Metric | Value |
|---|---|
| P50 reduction (short requests) | 17% |
| Practical Ο range | 0.55 β² Ο β² 0.80 |
| Predictor overhead | 0.029 ms / request |
The 17% figure is the representative real-world result at moderate load. Benefit scales with load imbalance β heavier burst workloads show larger gains.
Burst workload (mixed SHORT/LONG, high load)
| Model | Class | FCFS P50 | SJF P50 | Reduction |
|---|---|---|---|---|
| Gemma3:4b | SHORT | 229.5 s | 69.1 s | 70% |
| Llama3.1:8b | SHORT | 158.8 s | 38.0 s | 76% |
These represent the upper-bound case: synthetic burst workloads with heavy long-request interference.
Installation
pip install onnxruntime numpy
Requires Python β₯ 3.9. No GPU needed β inference runs on CPU.
To verify the model works on your machine (runs against Model A):
python test_inference.py
Note:
test_inference.pyvalidates Model A (predictor.onnx) only. Expected probability ranges inexamples.jsonare Model A-specific β Model B will produce different (generally better) scores for the same prompts.
Limitations
- English-only β lexical features assume English prompts; non-English prompts fall back to the
otherverb bucket - Cross-distribution degradation β ranking accuracy drops to 52β66% across distributions, still above random (50%), degrades toward FCFS at ~50%
- Ο calibration required β starvation timeout must be measured on your specific hardware;
config.jsonprovides defaults - Serial backends only β no benefit with native concurrency or continuous batching
- Token approximation β
len // 4is a BPE proxy; accuracy degrades for non-English or heavily formatted text
Citation
If you use Clairvoyant in your work, please cite:
@article{sundaresan2026clairvoyant,
title={Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends},
author={Sundaresan, Aravind},
journal={arXiv preprint arXiv:2606.07248},
year={2026}
}
Licence
- Downloads last month
- 44