anasnassar commited on
Commit
9936970
·
verified ·
1 Parent(s): 39cd7d9

Add model card

Browse files
Files changed (1) hide show
  1. README.md +114 -0
README.md CHANGED
@@ -1,3 +1,117 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ base_model: answerdotai/ModernBERT-base
4
+ language:
5
+ - en
6
+ tags:
7
+ - text-classification
8
+ - llm-routing
9
+ - query-complexity
10
+ - knowledge-distillation
11
+ - research-computing
12
+ pipeline_tag: text-classification
13
  ---
14
+
15
+ # LLM Query Complexity Classifier
16
+
17
+ Fine-tuned [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M parameters) for three-class query complexity classification: **LOW**, **MEDIUM**, or **HIGH**.
18
+
19
+ Built for the [STREAM](https://github.com/uicacer/STREAM) project (Smart Tiered Routing Engine for AI Models) to route queries automatically to the most cost-effective inference tier — local CPU, HPC GPU, or cloud API — at ~15ms per query with no API dependency.
20
+
21
+ ## What It Does
22
+
23
+ Given a user query, the model predicts how much reasoning depth is required to answer it:
24
+
25
+ | Label | Definition | Example |
26
+ |-------|------------|---------|
27
+ | `LOW` | Single retrievable fact. Answer statable in one sentence, no reasoning chain. | "What is the capital of France?" |
28
+ | `MEDIUM` | Apply an established procedure or assemble 2–4 concepts. Textbook-level reasoning. | "Explain quicksort and analyze its time complexity." |
29
+ | `HIGH` | Construct a novel reasoning path or expert judgment. No standard procedure. | "Is P equal to NP? Present the current state of evidence." |
30
+
31
+ **Key design principle**: complexity is defined by *reasoning depth*, not question format. "What is X?" can be LOW, MEDIUM, or HIGH depending on what reasoning is required to answer.
32
+
33
+ ## Usage
34
+
35
+ ```python
36
+ from transformers import pipeline
37
+
38
+ clf = pipeline(
39
+ "text-classification",
40
+ model="anasnassar/llm-query-complexity-classifier",
41
+ device=-1, # CPU
42
+ top_k=None, # return all class scores
43
+ )
44
+
45
+ result = clf("Explain the difference between TCP and UDP")
46
+ # [{'label': 'MEDIUM', 'score': 0.82}, {'label': 'LOW', 'score': 0.11}, {'label': 'HIGH', 'score': 0.07}]
47
+
48
+ complexity = max(result[0], key=lambda x: x["score"])["label"]
49
+ # 'MEDIUM'
50
+ ```
51
+
52
+ ## Training
53
+
54
+ **Knowledge distillation approach**: Claude Sonnet 4.6 (with extended thinking) labeled 6,912 queries across 6 domains and 3 complexity classes. ModernBERT-base was then fine-tuned on those labels. This is LLM-supervised fine-tuning — Claude generates hard labels; ModernBERT learns from them. The result runs at ~15ms per query with no API dependency.
55
+
56
+ **Training dataset**: [anasnassar/llm-query-complexity-benchmark](https://huggingface.co/datasets/anasnassar/llm-query-complexity-benchmark) — 6,912 queries, 6 domains, balanced across complexity classes.
57
+
58
+ **Hyperparameters**:
59
+
60
+ | Parameter | Value |
61
+ |-----------|-------|
62
+ | Base model | answerdotai/ModernBERT-base |
63
+ | Epochs | 5 |
64
+ | Batch size | 32 |
65
+ | Learning rate | 2e-5 |
66
+ | Max sequence length | 128 tokens |
67
+ | Optimizer | AdamW, weight_decay=0.01 |
68
+ | Warmup | 10% of steps |
69
+ | Best model metric | macro-F1 |
70
+
71
+ ## Evaluation
72
+
73
+ Three evaluation strategies are used to address data leakage from LLM-generated near-duplicates:
74
+
75
+ | Strategy | Description |
76
+ |----------|-------------|
77
+ | **Domain-held-out 6-fold CV** | Train on 5 domains, test on 6th. Primary reported metric. |
78
+ | **Similarity-aware split** | Near-duplicate queries (cosine sim > 0.90) kept on same side of split. |
79
+ | **Real-world (LMSYS Arena)** | Evaluated on real user prompts from Chatbot Arena — fully out-of-distribution. |
80
+
81
+ *Note: Random train/test split on LLM-generated data yields inflated accuracy (~99%) due to near-duplicate phrasings. Domain-held-out and real-world numbers are the rigorous metrics.*
82
+
83
+ Full evaluation code: [scripts/eval/](https://github.com/uicacer/STREAM/tree/main/scripts/eval)
84
+
85
+ ## Performance
86
+
87
+ | Judge | Latency (p50) | Notes |
88
+ |-------|--------------|-------|
89
+ | ModernBERT (this model) | ~15ms | CPU, no API dependency |
90
+ | Llama 3.2 3B (LLM judge) | ~390ms | Requires Ollama |
91
+
92
+ 26× latency reduction vs. the LLM judge baseline.
93
+
94
+ ## Integration in STREAM
95
+
96
+ ```python
97
+ from stream.middleware.core.complexity_judge import judge_complexity
98
+
99
+ result = judge_complexity("Explain quantum entanglement", strategy="modernbert")
100
+ # JudgmentResult(complexity='medium', method='classifier', strategy_used='modernbert',
101
+ # scores={'low': 0.08, 'medium': 0.79, 'high': 0.13})
102
+ ```
103
+
104
+ ## Citation
105
+
106
+ ```bibtex
107
+ @inproceedings{nassar2026stream,
108
+ title = {{STREAM}: Multi-Tier {LLM} Inference Middleware with Dual-Channel {HPC} Token Streaming},
109
+ author = {Nassar, Anas and Mohr, Steve and Apanasevich, Leonard and Sharma, Himanshu},
110
+ booktitle = {Practice and Experience in Advanced Research Computing (PEARC '26)},
111
+ year = {2026}
112
+ }
113
+ ```
114
+
115
+ ## License
116
+
117
+ Apache 2.0