LLM Query Complexity Classifier
Fine-tuned ModernBERT-base (149M parameters) for three-class query complexity classification: LOW, MEDIUM, or HIGH.
Built for the STREAM project (Smart Tiered Routing Engine for AI Models) to route queries automatically to the most cost-effective inference tier โ local CPU, HPC GPU, or cloud API โ at ~15ms per query with no API dependency.
What It Does
Given a user query, the model predicts how much reasoning depth is required to answer it:
| Label | Definition | Example |
|---|---|---|
LOW |
Single retrievable fact. Answer statable in one sentence, no reasoning chain. | "What is the capital of France?" |
MEDIUM |
Apply an established procedure or assemble 2โ4 concepts. Textbook-level reasoning. | "Explain quicksort and analyze its time complexity." |
HIGH |
Construct a novel reasoning path or expert judgment. No standard procedure. | "Is P equal to NP? Present the current state of evidence." |
Key design principle: complexity is defined by reasoning depth, not question format. "What is X?" can be LOW, MEDIUM, or HIGH depending on what reasoning is required to answer.
Usage
from transformers import pipeline
clf = pipeline(
"text-classification",
model="anasnassar/llm-query-complexity-classifier",
device=-1, # CPU
top_k=None, # return all class scores
)
result = clf("Explain the difference between TCP and UDP")
# [{'label': 'MEDIUM', 'score': 0.82}, {'label': 'LOW', 'score': 0.11}, {'label': 'HIGH', 'score': 0.07}]
complexity = max(result[0], key=lambda x: x["score"])["label"]
# 'MEDIUM'
Training
Knowledge distillation approach: Claude Sonnet 4.6 (with extended thinking) labeled 6,912 queries across 6 domains and 3 complexity classes. ModernBERT-base was then fine-tuned on those labels. This is LLM-supervised fine-tuning โ Claude generates hard labels; ModernBERT learns from them. The result runs at ~15ms per query with no API dependency.
Training dataset: anasnassar/llm-query-complexity-benchmark โ 6,912 queries, 6 domains, balanced across complexity classes.
Hyperparameters:
| Parameter | Value |
|---|---|
| Base model | answerdotai/ModernBERT-base |
| Epochs | 5 |
| Batch size | 32 |
| Learning rate | 2e-5 |
| Max sequence length | 128 tokens |
| Optimizer | AdamW, weight_decay=0.01 |
| Warmup | 10% of steps |
| Best model metric | macro-F1 |
Evaluation
Three evaluation strategies are used to address data leakage from LLM-generated near-duplicates:
| Strategy | Description |
|---|---|
| Domain-held-out 6-fold CV | Train on 5 domains, test on 6th. Primary reported metric. |
| Similarity-aware split | Near-duplicate queries (cosine sim > 0.90) kept on same side of split. |
| Real-world (LMSYS Arena) | Evaluated on real user prompts from Chatbot Arena โ fully out-of-distribution. |
Note: Random train/test split on LLM-generated data yields inflated accuracy (~99%) due to near-duplicate phrasings. Domain-held-out and real-world numbers are the rigorous metrics.
Full evaluation code: scripts/eval/
Performance
| Judge | Latency (p50) | Notes |
|---|---|---|
| ModernBERT (this model) | ~15ms | CPU, no API dependency |
| Llama 3.2 3B (LLM judge) | ~390ms | Requires Ollama |
26ร latency reduction vs. the LLM judge baseline.
Integration in STREAM
from stream.middleware.core.complexity_judge import judge_complexity
result = judge_complexity("Explain quantum entanglement", strategy="modernbert")
# JudgmentResult(complexity='medium', method='classifier', strategy_used='modernbert',
# scores={'low': 0.08, 'medium': 0.79, 'high': 0.13})
Citation
@inproceedings{nassar2026stream,
title = {{STREAM}: Multi-Tier {LLM} Inference Middleware with Dual-Channel {HPC} Token Streaming},
author = {Nassar, Anas and Mohr, Steve and Apanasevich, Leonard and Sharma, Himanshu},
booktitle = {Practice and Experience in Advanced Research Computing (PEARC '26)},
year = {2026}
}
License
Apache 2.0
- Downloads last month
- 28
Model tree for anasnassar/llm-query-complexity-classifier
Base model
answerdotai/ModernBERT-base