File size: 4,093 Bytes
15b89fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9800a1c
15b89fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
library_name: transformers
pipeline_tag: text-classification
tags:
  - regression
  - prompt
  - complexity-estimation
  - semantic-routing
  - llm-routing
base_model: microsoft/deberta-v3-base
license: apache-2.0
---

# PromptComplexityEstimator

A lightweight regressor that estimates the complexity of an LLM prompt on a scale between 0 and 1.

- **Input:** a string prompt  
- **Output:** a scalar score in [0, 1] (higher = more complex)

The model is designed primarily to be used as a core building block for semantic routing systems, especially LLM vs. SLM (Small Language Model) routers.  
Any router that aims to intelligently decide *which model should handle a request* needs a reliable signal for *how complex the request is*. This is the gap this model aims to close.

---

## Intended use

### Primary use case: LLM vs. SLM routing

This model is intended to be used as part of a semantic router, where:
- *Simple* prompts are handled by a **small / fast / cheap model**
- *Complex* prompts are routed to a **large / capable / expensive model**

The complexity score provides a learned signal for this decision.

### Additional use cases
- Prompt analytics and monitoring
- Dataset stratification by difficulty
- Adaptive compute allocation
- Cost-aware or latency-aware inference pipelines

### Not intended for
- Safety classification, toxicity detection, or policy enforcement
- Guaranteed difficulty estimation for a specific target model
- Multimodal inputs or tool-augmented workflows (RAG/tools)

---

## Usage

```python
import torch
from transformers import AutoTokenizer, AutoModel

repo_id = "ilya-kolchinsky/PromptComplexityEstimator"

tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()

prompt = "Design a distributed consensus protocol with Byzantine fault tolerance..."
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    score = model(**inputs).logits.squeeze(-1).item()

print(float(score))
```


### Example: Simple LLM vs. SLM routing

```python
THRESHOLD = 0.45  # chosen empirically

def route_prompt(prompt: str) -> str:
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        complexity = model(**inputs).logits.squeeze(-1).item()

    return "LLM" if complexity > THRESHOLD else "SLM"
```

---


## Model and Training Details

### Datasets
- Cross-Difficulty (https://huggingface.co/datasets/BatsResearch/Cross-Difficulty)
- [Easy2Hard-Bench](https://huggingface.co/datasets/furonghuang-lab/Easy2Hard-Bench)
- [MATH](https://huggingface.co/datasets/EleutherAI/hendrycks_math)
- [ARC](https://huggingface.co/datasets/allenai/ai2_arc)
- [RACE](https://huggingface.co/datasets/ehovy/race)
- [ANLI (R1/R2/R3)](https://huggingface.co/datasets/facebook/anli)

### Training Configuration
- **Epochs:** 3
- **Batch Size:** 16
- **Loss:** huber
- **Regressor Learning Rate:** 7.5e-5
- **Encoder Learning Rate:** 1.0e-5
- **Encoder Weight Decay:** 0.01
- **Optimizer**: AdamW
- **Schedule**: Cosine (warmup_ratio=0.06)
- **Dropout**: 0.1

### Model
- **Backbone encoder:** microsoft/deberta-v3-base
- Mask-aware **mean pooling** over token embeddings + **LayerNorm**
- **Regression head:** Linear → ReLU → Linear → Sigmoid
- **Max input length:** 512 tokens
- The model outputs a bounded score in [0, 1]. In the examples below, the score is read from `outputs.logits` (shape `[batch, 1]`).


Full training code and configuration are available at https://github.com/ilya-kolchinsky/ComplexityEstimator.

---

## Performance

On the held-out evaluation set used during development, the released checkpoint achieved:

- **MAE:** **0.0855**
- **Spearman correlation:** **0.735**

---

## Citation

```bibtex
@misc{kolchinsky_promptcomplexityestimator_2026,
  title        = {PromptComplexityEstimator},
  author       = {Ilya Kolchinsky},
  year         = {2026},
  howpublished = {Hugging Face Hub model: ilya-kolchinsky/PromptComplexityEstimator}
}
```