LLM Query Complexity Classifier

Fine-tuned ModernBERT-base (149M parameters) for three-class query complexity classification: LOW, MEDIUM, or HIGH.

Built for the STREAM project (Smart Tiered Routing Engine for AI Models) to route queries automatically to the most cost-effective inference tier โ€” local CPU, HPC GPU, or cloud API โ€” at ~15ms per query with no API dependency.

What It Does

Given a user query, the model predicts how much reasoning depth is required to answer it:

Label Definition Example
LOW Single retrievable fact. Answer statable in one sentence, no reasoning chain. "What is the capital of France?"
MEDIUM Apply an established procedure or assemble 2โ€“4 concepts. Textbook-level reasoning. "Explain quicksort and analyze its time complexity."
HIGH Construct a novel reasoning path or expert judgment. No standard procedure. "Is P equal to NP? Present the current state of evidence."

Key design principle: complexity is defined by reasoning depth, not question format. "What is X?" can be LOW, MEDIUM, or HIGH depending on what reasoning is required to answer.

Usage

from transformers import pipeline

clf = pipeline(
    "text-classification",
    model="anasnassar/llm-query-complexity-classifier",
    device=-1,      # CPU
    top_k=None,     # return all class scores
)

result = clf("Explain the difference between TCP and UDP")
# [{'label': 'MEDIUM', 'score': 0.82}, {'label': 'LOW', 'score': 0.11}, {'label': 'HIGH', 'score': 0.07}]

complexity = max(result[0], key=lambda x: x["score"])["label"]
# 'MEDIUM'

Training

Knowledge distillation approach: Claude Sonnet 4.6 (with extended thinking) labeled 6,912 queries across 6 domains and 3 complexity classes. ModernBERT-base was then fine-tuned on those labels. This is LLM-supervised fine-tuning โ€” Claude generates hard labels; ModernBERT learns from them. The result runs at ~15ms per query with no API dependency.

Training dataset: anasnassar/llm-query-complexity-benchmark โ€” 6,912 queries, 6 domains, balanced across complexity classes.

Hyperparameters:

Parameter Value
Base model answerdotai/ModernBERT-base
Epochs 5
Batch size 32
Learning rate 2e-5
Max sequence length 128 tokens
Optimizer AdamW, weight_decay=0.01
Warmup 10% of steps
Best model metric macro-F1

Evaluation

Three evaluation strategies are used to address data leakage from LLM-generated near-duplicates:

Strategy Description
Domain-held-out 6-fold CV Train on 5 domains, test on 6th. Primary reported metric.
Similarity-aware split Near-duplicate queries (cosine sim > 0.90) kept on same side of split.
Real-world (LMSYS Arena) Evaluated on real user prompts from Chatbot Arena โ€” fully out-of-distribution.

Note: Random train/test split on LLM-generated data yields inflated accuracy (~99%) due to near-duplicate phrasings. Domain-held-out and real-world numbers are the rigorous metrics.

Full evaluation code: scripts/eval/

Performance

Judge Latency (p50) Notes
ModernBERT (this model) ~15ms CPU, no API dependency
Llama 3.2 3B (LLM judge) ~390ms Requires Ollama

26ร— latency reduction vs. the LLM judge baseline.

Integration in STREAM

from stream.middleware.core.complexity_judge import judge_complexity

result = judge_complexity("Explain quantum entanglement", strategy="modernbert")
# JudgmentResult(complexity='medium', method='classifier', strategy_used='modernbert',
#                scores={'low': 0.08, 'medium': 0.79, 'high': 0.13})

Citation

@inproceedings{nassar2026stream,
  title     = {{STREAM}: Multi-Tier {LLM} Inference Middleware with Dual-Channel {HPC} Token Streaming},
  author    = {Nassar, Anas and Mohr, Steve and Apanasevich, Leonard and Sharma, Himanshu},
  booktitle = {Practice and Experience in Advanced Research Computing (PEARC '26)},
  year      = {2026}
}

License

Apache 2.0

Downloads last month
28
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for anasnassar/llm-query-complexity-classifier

Finetuned
(1260)
this model