Clairvoyant β€” LLM Response Length Predictor (ONNX)

XGBoost classifier that predicts whether an LLM will produce a short, medium, or long response β€” from the prompt alone, in 0.029 ms, no tokeniser required. Designed to power Shortest-Job-First (SJF) scheduling in serial LLM backends to reduce head-of-line blocking.

Paper: Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends
Code: github.com/Aravind0403/clairvoyant-scheduler


What This Is

Serial LLM backends (Ollama, llama.cpp in default mode) process one request at a time. A single long request blocks all shorter ones behind it β€” classic head-of-line blocking. Clairvoyant predicts response length at admission time and dispatches in SJF order, with a starvation timeout to protect long requests.

This repo contains the ONNX prediction models only. The Go sidecar proxy that uses them lives in the code repo above.

Scope:
βœ… Serial/low-concurrency backends β€” Ollama, llama.cpp defaults, local enterprise servers
βœ… Edge devices and privacy-sensitive deployments (CPU-only, no network calls)
❌ Not for vLLM, TGI, Orca, or any backend with native continuous batching (SJF adds no benefit there)


Models

Three models are provided, each trained on a different dataset. Model B is recommended for general use.

File Training Data Ranking Acc (in-dist) Class Acc (in-dist) Recommended
predictor.onnx (Model A) ShareGPT 76.3% 47.6% General fallback
predictor_model_b.onnx (Model B) LMSYS-Chat-1M 95.6% 66.8% βœ… Default
predictor_oasst1.onnx (Model C) OASST1 62.2% 41.0% Not recommended

Cross-distribution ranking accuracy for all models: 52–66% (graceful degradation toward FCFS at ~50%).

Ranking accuracy (the operationally relevant metric) measures whether the model correctly orders two requests by length β€” what matters for SJF dispatch. Class accuracy measures exact 3-class prediction.

Model C is included for reproducibility. Its lower accuracy means it degrades toward FCFS behaviour rather than causing harm, but Model B is strictly better for all tested distributions.


Output Classes

Class Token range Scheduling action
Short < 200 tokens Dispatch first
Medium 200–800 tokens Dispatch in order
Long β‰₯ 800 tokens Dispatch last (or defer until Ο„)

Token count is approximated as len(response) // 4 (BPE proxy, Β±10% for English prose).


Features

19 lexical features extracted from the prompt β€” no tokeniser, no model call:

# Feature Type
1 prompt_token_len Approx token count (len // 4)
2 has_code_keyword Binary β€” code/function/sql/regex/etc. present
3 has_length_constraint Binary β€” "in N words", "briefly", "tl;dr", etc.
4 ends_with_question Binary β€” prompt ends with ?
5 has_format_keyword Binary β€” json/csv/yaml/table/etc. present
6 clause_count Count of clauses split by punctuation
7–19 verb_* One-hot over 13 instruction verbs: what, write, explain, summarize, how, list, implement, compare, describe, generate, why, define, other

Feature extraction is in feature_extractor.py.


Quick Start

import onnxruntime as ort
import numpy as np
from feature_extractor import extract_features

# Load Model B (recommended)
session = ort.InferenceSession("predictor_model_b.onnx")
input_name = session.get_inputs()[0].name

prompt = "Write a React tic-tac-toe app with Redux and full unit tests."
features = extract_features(prompt)
probs = session.run(None, {input_name: np.array([features], dtype=np.float32)})[1][0]

p_short, p_medium, p_long = probs
print(f"P(Short)={p_short:.3f}  P(Medium)={p_medium:.3f}  P(Long)={p_long:.3f}")

# SJF dispatch key: lower priority = dispatch last
priority = p_long  # queue requests by ascending p_long
print(f"Dispatch priority key: {priority:.3f}  β†’  {'defer' if priority > 0.5 else 'dispatch now'}")

Starvation Timeout

To prevent long requests from waiting indefinitely, set a timeout:

tau = 3 * mean_short_latency  # measure mean_short_latency on your hardware
# When a request has waited longer than tau, promote it to front of queue

Benchmark Results

Evaluated on an RTX 4090 with Ollama (n=250 per cell, Poisson arrivals).

Steady-state (Poisson ρ = 0.74)

Metric Value
P50 reduction (short requests) 17%
Practical ρ range 0.55 ≲ ρ ≲ 0.80
Predictor overhead 0.029 ms / request

The 17% figure is the representative real-world result at moderate load. Benefit scales with load imbalance β€” heavier burst workloads show larger gains.

Burst workload (mixed SHORT/LONG, high load)

Model Class FCFS P50 SJF P50 Reduction
Gemma3:4b SHORT 229.5 s 69.1 s 70%
Llama3.1:8b SHORT 158.8 s 38.0 s 76%

These represent the upper-bound case: synthetic burst workloads with heavy long-request interference.


Installation

pip install onnxruntime numpy

Requires Python β‰₯ 3.9. No GPU needed β€” inference runs on CPU.

To verify the model works on your machine (runs against Model A):

python test_inference.py

Note: test_inference.py validates Model A (predictor.onnx) only. Expected probability ranges in examples.json are Model A-specific β€” Model B will produce different (generally better) scores for the same prompts.


Limitations

  • English-only β€” lexical features assume English prompts; non-English prompts fall back to the other verb bucket
  • Cross-distribution degradation β€” ranking accuracy drops to 52–66% across distributions, still above random (50%), degrades toward FCFS at ~50%
  • Ο„ calibration required β€” starvation timeout must be measured on your specific hardware; config.json provides defaults
  • Serial backends only β€” no benefit with native concurrency or continuous batching
  • Token approximation β€” len // 4 is a BPE proxy; accuracy degrades for non-English or heavily formatted text

Citation

If you use Clairvoyant in your work, please cite:

@article{sundaresan2026clairvoyant,
  title={Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends},
  author={Sundaresan, Aravind},
  journal={arXiv preprint arXiv:2606.07248},
  year={2026}
}

Licence

Models and code: MIT
Paper: CC BY 4.0

Downloads last month
44
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Aravind0495/clairvoyant-scheduler