Update model card

01bce42 verified 3 days ago

6.87 kB

	---
	license: apache-2.0
	base_model: answerdotai/ModernBERT-base
	language:
	- en
	tags:
	- text-classification
	- llm-routing
	- query-complexity
	- knowledge-distillation
	- research-computing
	- hpc
	pipeline_tag: text-classification
	---

	# LLM Query Complexity Classifier

	Fine-tuned [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M parameters) for three-class query complexity classification: LOW, MEDIUM, or HIGH.

	Built for the [STREAM](https://github.com/uicacer/STREAM) project (Smart Tiered Routing Engine for AI Models) to route queries automatically to the most cost-effective inference tier — local CPU, HPC GPU, or cloud API — at ~32 ms per query (CPU p50) with no API dependency.

	Covers 10 domains representing the full breadth of a research university population: hpc, mathematics, statistics_ml, physics_chemistry, engineering, life_sciences, cs_software, philosophy_ethics, social_sciences, and history_culture.

	## What It Does

	Given a user query, the model predicts how much reasoning depth is required to answer it:

	\| Label \| Definition \| Example \|
	\|-------\|------------\|---------\|
	\| `LOW` \| Single retrievable fact. Answer statable in one sentence, no reasoning chain. \| "What is the capital of France?" \|
	\| `MEDIUM` \| Apply an established procedure or assemble 2–4 concepts. Textbook-level reasoning. \| "Explain quicksort and analyze its time complexity." \|
	\| `HIGH` \| Construct a novel reasoning path or expert judgment. No standard procedure. \| "Is P equal to NP? Present the current state of evidence." \|

	Key design principle: complexity is defined by reasoning depth, not question format. "What is X?" can be LOW, MEDIUM, or HIGH depending on what reasoning is required to answer.

	## Usage

	```python
	from transformers import pipeline

	clf = pipeline(
	"text-classification",
	model="anasnassar/llm-query-complexity-classifier",
	device=-1, # CPU
	top_k=None, # return all class scores
	)

	result = clf("Explain the difference between TCP and UDP")
	# [{'label': 'MEDIUM', 'score': 0.82}, {'label': 'LOW', 'score': 0.11}, {'label': 'HIGH', 'score': 0.07}]

	complexity = max(result[0], key=lambda x: x["score"])["label"]
	# 'MEDIUM'
	```

	## Training

	Knowledge distillation approach: Claude Sonnet 4.6 labeled 6,000 queries using a reasoning-depth rubric. ModernBERT-base was fine-tuned on those labels. The result runs at ~32 ms per query (CPU p50) with no API dependency — a 5× latency reduction vs. the LLM judge baseline.

	Training dataset: [anasnassar/llm-query-complexity-benchmark](https://huggingface.co/datasets/anasnassar/llm-query-complexity-benchmark) — 6,000 doubly balanced queries across 10 domains × 3 complexity classes (200/domain/class hard cap; 4,800 train / 1,200 test, 80/20 stratified split, seed=42).

	Sources: Derived from [sentence-transformers/stackexchange-duplicates](https://huggingface.co/datasets/sentence-transformers/stackexchange-duplicates) (Apache 2.0), [cais/mmlu](https://huggingface.co/datasets/cais/mmlu) (MIT), [TIGER-Lab/MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) (MIT), and [qiaojin/PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA) (MIT).

	Hyperparameters:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base model \| answerdotai/ModernBERT-base \|
	\| Epochs \| 5 \|
	\| Batch size \| 32 \|
	\| Learning rate \| 2e-5 \|
	\| Max sequence length \| 128 tokens \|
	\| Optimizer \| AdamW, weight_decay=0.01 \|
	\| Warmup \| 10% of steps \|
	\| Best model metric \| macro-F1 \|

	## Evaluation

	Evaluated on a fixed 750-query held-out test set (250/class), stratified split, seed=42.

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Accuracy \| 64.2% \|
	\| Macro-F1 \| 0.640 \|
	\| FREE-tier retention \| 85.4% \|
	\| Latency p50 (CPU) \| 32 ms \|

	Per-class recall (Wilson 95% CI):

	\| Class \| Recall \| 95% CI \|
	\|-------\|--------\|--------\|
	\| LOW \| 70.8% \| [66.1%, 75.0%] \|
	\| MEDIUM \| 49.3% \| [44.4%, 54.1%] \|
	\| HIGH \| 72.5% \| [67.9%, 76.7%] \|

	## Judge Comparison

	\| Judge \| Latency p50 \| Accuracy \| Macro-F1 \| API dependency \|
	\|-------\|-------------\|----------\|----------\|----------------\|
	\| ModernBERT (this model) \| 32 ms \| 64.2% \| 0.640 \| None \|
	\| Llama 3.2 3B (LLM judge) \| 164 ms \| 49.0% \| 0.436 \| Ollama \|

	## Threshold-Tunable Routing

	Rather than a fixed argmax decision, STREAM exposes a tunable threshold θ ∈ [0,1]. A query is routed to cloud when `P(HIGH) ≥ θ`; otherwise to HPC or local. As θ increases, cloud spend drops but HIGH recall decreases — a continuous precision-recall-cost tradeoff.

	Budget-aware adaptive routing automatically raises θ as cloud spend approaches the monthly budget cap:

	```
	θ_eff(t) = max(θ_base, S(t)/B)
	```

	where S(t) is cumulative spend and B is the monthly budget.

	## Integration in STREAM

	```python
	from stream.middleware.core.complexity_judge import judge_complexity

	result = judge_complexity("Explain quantum entanglement", strategy="modernbert")
	# JudgmentResult(complexity='medium', method='classifier', strategy_used='modernbert',
	# scores={'low': 0.08, 'medium': 0.79, 'high': 0.13})
	```

	## Citation

	```bibtex
	@inproceedings{nassar2026stream,
	title = {{STREAM}: Multi-Tier {LLM} Inference Middleware with Dual-Channel {HPC} Token Streaming},
	author = {Nassar, Anas and Mohr, Steve and Apanasevich, Leonard and Sharma, Himanshu},
	booktitle = {Practice and Experience in Advanced Research Computing (PEARC '26)},
	year = {2026},
	doi = {10.1145/3785462.3815847}
	}

	@misc{nassar2026benchmark,
	author = {Nassar, Anas},
	title = {{LLM} Query Complexity Benchmark},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/datasets/anasnassar/llm-query-complexity-benchmark}
	}

	% Original source datasets
	@article{hendrycks2021mmlu,
	author = {Dan Hendrycks and others},
	title = {Measuring Massive Multitask Language Understanding},
	journal = {ICLR},
	year = {2021},
	url = {https://huggingface.co/datasets/cais/mmlu}
	}

	@article{wang2024mmlupro,
	author = {Yubo Wang and others},
	title = {{MMLU-Pro}: A More Robust and Challenging Multi-Task Language Understanding Benchmark},
	journal = {arXiv:2406.01574},
	year = {2024},
	url = {https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro}
	}

	@inproceedings{jin2019pubmedqa,
	author = {Qiao Jin and others},
	title = {{PubMedQA}: A Biomedical Research Question Answering Dataset},
	booktitle = {EMNLP},
	year = {2019},
	url = {https://huggingface.co/datasets/qiaojin/PubMedQA}
	}

	@misc{stackexchange_dataset,
	author = {Reimers, Nils and Gurevych, Iryna},
	title = {{StackExchange} Duplicate Questions Dataset},
	year = {2019},
	url = {https://huggingface.co/datasets/sentence-transformers/stackexchange-duplicates}
	}
	```

	## License

	Apache 2.0