| --- |
| license: apache-2.0 |
| base_model: answerdotai/ModernBERT-base |
| language: |
| - en |
| tags: |
| - text-classification |
| - llm-routing |
| - query-complexity |
| - knowledge-distillation |
| - research-computing |
| - hpc |
| pipeline_tag: text-classification |
| --- |
| |
| # LLM Query Complexity Classifier |
|
|
| Fine-tuned [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M parameters) for three-class query complexity classification: **LOW**, **MEDIUM**, or **HIGH**. |
|
|
| Built for the [STREAM](https://github.com/uicacer/STREAM) project (Smart Tiered Routing Engine for AI Models) to route queries automatically to the most cost-effective inference tier — local CPU, HPC GPU, or cloud API — at ~32 ms per query (CPU p50) with no API dependency. |
|
|
| Covers **10 domains** representing the full breadth of a research university population: hpc, mathematics, statistics_ml, physics_chemistry, engineering, life_sciences, cs_software, philosophy_ethics, social_sciences, and history_culture. |
| |
| ## What It Does |
| |
| Given a user query, the model predicts how much reasoning depth is required to answer it: |
| |
| | Label | Definition | Example | |
| |-------|------------|---------| |
| | `LOW` | Single retrievable fact. Answer statable in one sentence, no reasoning chain. | "What is the capital of France?" | |
| | `MEDIUM` | Apply an established procedure or assemble 2–4 concepts. Textbook-level reasoning. | "Explain quicksort and analyze its time complexity." | |
| | `HIGH` | Construct a novel reasoning path or expert judgment. No standard procedure. | "Is P equal to NP? Present the current state of evidence." | |
| |
| **Key design principle**: complexity is defined by *reasoning depth*, not question format. "What is X?" can be LOW, MEDIUM, or HIGH depending on what reasoning is required to answer. |
| |
| ## Usage |
| |
| ```python |
| from transformers import pipeline |
| |
| clf = pipeline( |
| "text-classification", |
| model="anasnassar/llm-query-complexity-classifier", |
| device=-1, # CPU |
| top_k=None, # return all class scores |
| ) |
|
|
| result = clf("Explain the difference between TCP and UDP") |
| # [{'label': 'MEDIUM', 'score': 0.82}, {'label': 'LOW', 'score': 0.11}, {'label': 'HIGH', 'score': 0.07}] |
|
|
| complexity = max(result[0], key=lambda x: x["score"])["label"] |
| # 'MEDIUM' |
| ``` |
| |
| ## Training |
| |
| **Knowledge distillation approach**: Claude Sonnet 4.6 labeled 6,000 queries using a reasoning-depth rubric. ModernBERT-base was fine-tuned on those labels. The result runs at ~32 ms per query (CPU p50) with no API dependency — a 5× latency reduction vs. the LLM judge baseline. |
| |
| **Training dataset**: [anasnassar/llm-query-complexity-benchmark](https://huggingface.co/datasets/anasnassar/llm-query-complexity-benchmark) — 6,000 doubly balanced queries across 10 domains × 3 complexity classes (200/domain/class hard cap; 4,800 train / 1,200 test, 80/20 stratified split, seed=42). |
| |
| **Sources**: Derived from [sentence-transformers/stackexchange-duplicates](https://huggingface.co/datasets/sentence-transformers/stackexchange-duplicates) (Apache 2.0), [cais/mmlu](https://huggingface.co/datasets/cais/mmlu) (MIT), [TIGER-Lab/MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) (MIT), and [qiaojin/PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA) (MIT). |
| |
| **Hyperparameters**: |
| |
| | Parameter | Value | |
| |-----------|-------| |
| | Base model | answerdotai/ModernBERT-base | |
| | Epochs | 5 | |
| | Batch size | 32 | |
| | Learning rate | 2e-5 | |
| | Max sequence length | 128 tokens | |
| | Optimizer | AdamW, weight_decay=0.01 | |
| | Warmup | 10% of steps | |
| | Best model metric | macro-F1 | |
| |
| ## Evaluation |
| |
| Evaluated on a fixed 750-query held-out test set (250/class), stratified split, seed=42. |
| |
| | Metric | Value | |
| |--------|-------| |
| | Accuracy | 64.2% | |
| | Macro-F1 | 0.640 | |
| | FREE-tier retention | 85.4% | |
| | Latency p50 (CPU) | 32 ms | |
| |
| **Per-class recall (Wilson 95% CI):** |
| |
| | Class | Recall | 95% CI | |
| |-------|--------|--------| |
| | LOW | 70.8% | [66.1%, 75.0%] | |
| | MEDIUM | 49.3% | [44.4%, 54.1%] | |
| | HIGH | 72.5% | [67.9%, 76.7%] | |
| |
| ## Judge Comparison |
| |
| | Judge | Latency p50 | Accuracy | Macro-F1 | API dependency | |
| |-------|-------------|----------|----------|----------------| |
| | ModernBERT (this model) | 32 ms | 64.2% | 0.640 | None | |
| | Llama 3.2 3B (LLM judge) | 164 ms | 49.0% | 0.436 | Ollama | |
| |
| ## Threshold-Tunable Routing |
| |
| Rather than a fixed argmax decision, STREAM exposes a tunable threshold θ ∈ [0,1]. A query is routed to cloud when `P(HIGH) ≥ θ`; otherwise to HPC or local. As θ increases, cloud spend drops but HIGH recall decreases — a continuous precision-recall-cost tradeoff. |
| |
| **Budget-aware adaptive routing** automatically raises θ as cloud spend approaches the monthly budget cap: |
| |
| ``` |
| θ_eff(t) = max(θ_base, S(t)/B) |
| ``` |
| |
| where S(t) is cumulative spend and B is the monthly budget. |
| |
| ## Integration in STREAM |
| |
| ```python |
| from stream.middleware.core.complexity_judge import judge_complexity |
|
|
| result = judge_complexity("Explain quantum entanglement", strategy="modernbert") |
| # JudgmentResult(complexity='medium', method='classifier', strategy_used='modernbert', |
| # scores={'low': 0.08, 'medium': 0.79, 'high': 0.13}) |
| ``` |
| |
| ## Citation |
| |
| ```bibtex |
| @inproceedings{nassar2026stream, |
| title = {{STREAM}: Multi-Tier {LLM} Inference Middleware with Dual-Channel {HPC} Token Streaming}, |
| author = {Nassar, Anas and Mohr, Steve and Apanasevich, Leonard and Sharma, Himanshu}, |
| booktitle = {Practice and Experience in Advanced Research Computing (PEARC '26)}, |
| year = {2026}, |
| doi = {10.1145/3785462.3815847} |
| } |
|
|
| @misc{nassar2026benchmark, |
| author = {Nassar, Anas}, |
| title = {{LLM} Query Complexity Benchmark}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/datasets/anasnassar/llm-query-complexity-benchmark} |
| } |
|
|
| % Original source datasets |
| @article{hendrycks2021mmlu, |
| author = {Dan Hendrycks and others}, |
| title = {Measuring Massive Multitask Language Understanding}, |
| journal = {ICLR}, |
| year = {2021}, |
| url = {https://huggingface.co/datasets/cais/mmlu} |
| } |
|
|
| @article{wang2024mmlupro, |
| author = {Yubo Wang and others}, |
| title = {{MMLU-Pro}: A More Robust and Challenging Multi-Task Language Understanding Benchmark}, |
| journal = {arXiv:2406.01574}, |
| year = {2024}, |
| url = {https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro} |
| } |
|
|
| @inproceedings{jin2019pubmedqa, |
| author = {Qiao Jin and others}, |
| title = {{PubMedQA}: A Biomedical Research Question Answering Dataset}, |
| booktitle = {EMNLP}, |
| year = {2019}, |
| url = {https://huggingface.co/datasets/qiaojin/PubMedQA} |
| } |
|
|
| @misc{stackexchange_dataset, |
| author = {Reimers, Nils and Gurevych, Iryna}, |
| title = {{StackExchange} Duplicate Questions Dataset}, |
| year = {2019}, |
| url = {https://huggingface.co/datasets/sentence-transformers/stackexchange-duplicates} |
| } |
| ``` |
| |
| ## License |
| |
| Apache 2.0 |
| |