Clairvoyant — LLM Response Length Predictor (ONNX)

XGBoost classifier that predicts whether an LLM will produce a short, medium, or long response — from the prompt alone, in 0.029 ms, no tokeniser required. Designed to power Shortest-Job-First (SJF) scheduling in serial LLM backends to reduce head-of-line blocking.

Paper: Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends
Code: github.com/Aravind0403/clairvoyant-scheduler

What This Is

Serial LLM backends (Ollama, llama.cpp in default mode) process one request at a time. A single long request blocks all shorter ones behind it — classic head-of-line blocking. Clairvoyant predicts response length at admission time and dispatches in SJF order, with a starvation timeout to protect long requests.

This repo contains the ONNX prediction models only. The Go sidecar proxy that uses them lives in the code repo above.

Scope:
✅ Serial/low-concurrency backends — Ollama, llama.cpp defaults, local enterprise servers
✅ Edge devices and privacy-sensitive deployments (CPU-only, no network calls)
❌ Not for vLLM, TGI, Orca, or any backend with native continuous batching (SJF adds no benefit there)

Models

Three models are provided, each trained on a different dataset. Model B is recommended for general use.

File	Training Data	Ranking Acc (in-dist)	Class Acc (in-dist)	Recommended
`predictor.onnx` (Model A)	ShareGPT	76.3%	47.6%	General fallback
`predictor_model_b.onnx` (Model B)	LMSYS-Chat-1M	95.6%	66.8%	✅ Default
`predictor_oasst1.onnx` (Model C)	OASST1	62.2%	41.0%	Not recommended

Cross-distribution ranking accuracy for all models: 52–66% (graceful degradation toward FCFS at ~50%).

Ranking accuracy (the operationally relevant metric) measures whether the model correctly orders two requests by length — what matters for SJF dispatch. Class accuracy measures exact 3-class prediction.

Model C is included for reproducibility. Its lower accuracy means it degrades toward FCFS behaviour rather than causing harm, but Model B is strictly better for all tested distributions.

Output Classes

Class	Token range	Scheduling action
Short	< 200 tokens	Dispatch first
Medium	200–800 tokens	Dispatch in order
Long	≥ 800 tokens	Dispatch last (or defer until τ)

Token count is approximated as len(response) // 4 (BPE proxy, ±10% for English prose).

Features

19 lexical features extracted from the prompt — no tokeniser, no model call:

#	Feature	Type
1	`prompt_token_len`	Approx token count (`len // 4`)
2	`has_code_keyword`	Binary — code/function/sql/regex/etc. present
3	`has_length_constraint`	Binary — "in N words", "briefly", "tl;dr", etc.
4	`ends_with_question`	Binary — prompt ends with `?`
5	`has_format_keyword`	Binary — json/csv/yaml/table/etc. present
6	`clause_count`	Count of clauses split by punctuation
7–19	`verb_*`	One-hot over 13 instruction verbs: what, write, explain, summarize, how, list, implement, compare, describe, generate, why, define, other

Feature extraction is in feature_extractor.py.

Quick Start

import onnxruntime as ort
import numpy as np
from feature_extractor import extract_features

# Load Model B (recommended)
session = ort.InferenceSession("predictor_model_b.onnx")
input_name = session.get_inputs()[0].name

prompt = "Write a React tic-tac-toe app with Redux and full unit tests."
features = extract_features(prompt)
probs = session.run(None, {input_name: np.array([features], dtype=np.float32)})[1][0]

p_short, p_medium, p_long = probs
print(f"P(Short)={p_short:.3f}  P(Medium)={p_medium:.3f}  P(Long)={p_long:.3f}")

# SJF dispatch key: lower priority = dispatch last
priority = p_long  # queue requests by ascending p_long
print(f"Dispatch priority key: {priority:.3f}  →  {'defer' if priority > 0.5 else 'dispatch now'}")

Starvation Timeout

To prevent long requests from waiting indefinitely, set a timeout:

tau = 3 * mean_short_latency  # measure mean_short_latency on your hardware
# When a request has waited longer than tau, promote it to front of queue

Benchmark Results

Evaluated on an RTX 4090 with Ollama (n=250 per cell, Poisson arrivals).

Steady-state (Poisson ρ = 0.74)

Metric	Value
P50 reduction (short requests)	17%
Practical ρ range	0.55 ≲ ρ ≲ 0.80
Predictor overhead	0.029 ms / request

The 17% figure is the representative real-world result at moderate load. Benefit scales with load imbalance — heavier burst workloads show larger gains.

Burst workload (mixed SHORT/LONG, high load)

Model	Class	FCFS P50	SJF P50	Reduction
Gemma3:4b	SHORT	229.5 s	69.1 s	70%
Llama3.1:8b	SHORT	158.8 s	38.0 s	76%

These represent the upper-bound case: synthetic burst workloads with heavy long-request interference.

Installation

pip install onnxruntime numpy

Requires Python ≥ 3.9. No GPU needed — inference runs on CPU.

To verify the model works on your machine (runs against Model A):

python test_inference.py

Note: test_inference.py validates Model A (predictor.onnx) only. Expected probability ranges in examples.json are Model A-specific — Model B will produce different (generally better) scores for the same prompts.

Limitations

English-only — lexical features assume English prompts; non-English prompts fall back to the other verb bucket
Cross-distribution degradation — ranking accuracy drops to 52–66% across distributions, still above random (50%), degrades toward FCFS at ~50%
τ calibration required — starvation timeout must be measured on your specific hardware; config.json provides defaults
Serial backends only — no benefit with native concurrency or continuous batching
Token approximation — len // 4 is a BPE proxy; accuracy degrades for non-English or heavily formatted text

Citation

If you use Clairvoyant in your work, please cite:

@article{sundaresan2026clairvoyant,
  title={Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends},
  author={Sundaresan, Aravind},
  journal={arXiv preprint arXiv:2606.07248},
  year={2026}
}

Licence

Models and code: MIT
Paper: CC BY 4.0

Downloads last month: 44

Paper for Aravind0495/clairvoyant-scheduler

Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

Paper • 2606.07248 • Published 4 days ago