root commited on
Commit ·
60a6bce
1
Parent(s): e6d73e1
Add GGUF quantized models (BF16, Q8_0, Q4_K_M) and update README with GGUF usage section
Browse files- .gitattributes +1 -0
- README.md +123 -20
- brick-complexity-extractor-BF16.gguf +3 -0
- brick-complexity-extractor-Q4_K_M.gguf +3 -0
- brick-complexity-extractor-Q8_0.gguf +3 -0
.gitattributes
CHANGED
|
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
*.gguf filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -7,6 +7,7 @@ tags:
|
|
| 7 |
- peft
|
| 8 |
- safetensors
|
| 9 |
- lora
|
|
|
|
| 10 |
- complexity-classification
|
| 11 |
- llm-routing
|
| 12 |
- query-difficulty
|
|
@@ -41,11 +42,11 @@ model-index:
|
|
| 41 |
|
| 42 |
<div align="center">
|
| 43 |
|
| 44 |
-
#
|
| 45 |
|
| 46 |
### A lightweight LoRA adapter for real-time query complexity classification
|
| 47 |
|
| 48 |
-
**[Regolo.ai](https://regolo.ai)
|
| 49 |
|
| 50 |
[](https://creativecommons.org/licenses/by-nc/4.0/)
|
| 51 |
[](https://huggingface.co/Qwen/Qwen3.5-0.8B)
|
|
@@ -64,6 +65,7 @@ model-index:
|
|
| 64 |
- [Label Definitions](#label-definitions)
|
| 65 |
- [Performance](#performance)
|
| 66 |
- [Quick Start](#quick-start)
|
|
|
|
| 67 |
- [Integration with Brick Semantic Router](#integration-with-brick-semantic-router)
|
| 68 |
- [Intended Uses](#intended-uses)
|
| 69 |
- [Limitations](#limitations)
|
|
@@ -82,7 +84,7 @@ The adapter adds only **~2M trainable parameters** on top of the 0.8B base model
|
|
| 82 |
|
| 83 |
## The Problem: Why LLM Routing Needs Complexity Classification
|
| 84 |
|
| 85 |
-
Not all prompts are equal. A factual recall question ("What is the capital of France?") and a multi-step reasoning task ("Derive the optimal portfolio allocation given these constraints
|
| 86 |
|
| 87 |
**Brick** solves this by routing each query to the right model tier in real time. Complexity classification is one of several routing signals (alongside keyword matching, domain detection, and reasoning-depth estimation) that Brick uses to make sub-50ms routing decisions.
|
| 88 |
|
|
@@ -110,26 +112,26 @@ The adapter applies LoRA to the query and value projection matrices (`q_proj`, `
|
|
| 110 |
|
| 111 |
```
|
| 112 |
Qwen3.5-0.8B (frozen)
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
```
|
| 119 |
|
| 120 |
## Label Definitions
|
| 121 |
|
| 122 |
| Label | Reasoning Steps | Description | Example |
|
| 123 |
|---|---|---|---|
|
| 124 |
-
| **easy** | 1
|
| 125 |
-
| **medium** | 3
|
| 126 |
| **hard** | 6+ | Deep expertise, multi-constraint optimization, creative synthesis | "Design a distributed cache eviction policy that minimizes tail latency under bursty traffic" |
|
| 127 |
|
| 128 |
Labels were generated by **Qwen3.5-122B** acting as an LLM judge on 76,831 diverse user prompts. See the [dataset card](https://huggingface.co/datasets/regolo/brick-complexity-extractor) for full labeling methodology.
|
| 129 |
|
| 130 |
## Performance
|
| 131 |
|
| 132 |
-
### Classification Metrics (Test Set
|
| 133 |
|
| 134 |
| Metric | Value |
|
| 135 |
|---|---|
|
|
@@ -197,6 +199,107 @@ print(f"Complexity: {predicted}")
|
|
| 197 |
# https://github.com/regolo-ai/brick-SR1
|
| 198 |
```
|
| 199 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
## Integration with Brick Semantic Router
|
| 201 |
|
| 202 |
Brick Complexity Extractor is designed to work as a signal within the **Brick Semantic Router** pipeline. In a typical deployment:
|
|
@@ -236,14 +339,14 @@ model_pools:
|
|
| 236 |
|
| 237 |
## Intended Uses
|
| 238 |
|
| 239 |
-
###
|
| 240 |
-
- **LLM routing**: Classify query complexity to route to the optimal model tier, reducing inference cost by 30
|
| 241 |
- **Reasoning budget allocation**: Decide how many reasoning tokens to allocate before inference begins
|
| 242 |
- **Traffic shaping**: Balance GPU load across model pools based on real-time complexity distribution
|
| 243 |
- **Cost monitoring**: Track complexity distribution over time to optimize fleet sizing
|
| 244 |
|
| 245 |
-
###
|
| 246 |
-
- **Content moderation or safety filtering**
|
| 247 |
- **Non-English queries** trained on English data only; accuracy degrades significantly on other languages
|
| 248 |
- **Direct use as a chatbot or generative model** this is a classification adapter, not a generative model
|
| 249 |
|
|
@@ -261,7 +364,7 @@ model_pools:
|
|
| 261 |
|---|---|
|
| 262 |
| **Base model** | Qwen/Qwen3.5-0.8B |
|
| 263 |
| **LoRA rank (r)** | 16 |
|
| 264 |
-
| **LoRA alpha
|
| 265 |
| **LoRA dropout** | 0.05 |
|
| 266 |
| **Target modules** | q_proj, v_proj |
|
| 267 |
| **Learning rate** | 2e-4 |
|
|
@@ -273,7 +376,7 @@ model_pools:
|
|
| 273 |
| **Training samples** | 65,307 |
|
| 274 |
| **Validation samples** | 7,683 |
|
| 275 |
| **Test samples** | 3,841 |
|
| 276 |
-
| **Training hardware** |
|
| 277 |
| **Training time** | ~2 hours |
|
| 278 |
| **Framework** | PyTorch + HuggingFace PEFT |
|
| 279 |
|
|
@@ -283,9 +386,9 @@ Regolo.ai is committed to sustainable AI. This model was trained on GPU infrastr
|
|
| 283 |
|
| 284 |
| Metric | Value |
|
| 285 |
|---|---|
|
| 286 |
-
| **Hardware** |
|
| 287 |
| **Training duration** | ~2 hours |
|
| 288 |
-
| **Estimated
|
| 289 |
| **Energy source** | Renewable (certified) |
|
| 290 |
| **Location** | Italy (EU) |
|
| 291 |
|
|
@@ -308,6 +411,6 @@ Regolo.ai is committed to sustainable AI. This model was trained on GPU infrastr
|
|
| 308 |
|
| 309 |
<div align="center">
|
| 310 |
|
| 311 |
-
**[Website](https://regolo.ai)
|
| 312 |
|
| 313 |
</div>
|
|
|
|
| 7 |
- peft
|
| 8 |
- safetensors
|
| 9 |
- lora
|
| 10 |
+
- gguf
|
| 11 |
- complexity-classification
|
| 12 |
- llm-routing
|
| 13 |
- query-difficulty
|
|
|
|
| 42 |
|
| 43 |
<div align="center">
|
| 44 |
|
| 45 |
+
# Brick Complexity Extractor
|
| 46 |
|
| 47 |
### A lightweight LoRA adapter for real-time query complexity classification
|
| 48 |
|
| 49 |
+
**[Regolo.ai](https://regolo.ai) | [Dataset](https://huggingface.co/datasets/regolo/brick-complexity-extractor) | [Brick SR1 on GitHub](https://github.com/regolo-ai/brick-SR1) | [API Docs](https://docs.regolo.ai)**
|
| 50 |
|
| 51 |
[](https://creativecommons.org/licenses/by-nc/4.0/)
|
| 52 |
[](https://huggingface.co/Qwen/Qwen3.5-0.8B)
|
|
|
|
| 65 |
- [Label Definitions](#label-definitions)
|
| 66 |
- [Performance](#performance)
|
| 67 |
- [Quick Start](#quick-start)
|
| 68 |
+
- [GGUF Quantized Models](#gguf-quantized-models)
|
| 69 |
- [Integration with Brick Semantic Router](#integration-with-brick-semantic-router)
|
| 70 |
- [Intended Uses](#intended-uses)
|
| 71 |
- [Limitations](#limitations)
|
|
|
|
| 84 |
|
| 85 |
## The Problem: Why LLM Routing Needs Complexity Classification
|
| 86 |
|
| 87 |
+
Not all prompts are equal. A factual recall question ("What is the capital of France?") and a multi-step reasoning task ("Derive the optimal portfolio allocation given these constraints...") require fundamentally different compute budgets. Sending every query to a frontier reasoning model wastes resources; sending hard queries to a lightweight model degrades quality.
|
| 88 |
|
| 89 |
**Brick** solves this by routing each query to the right model tier in real time. Complexity classification is one of several routing signals (alongside keyword matching, domain detection, and reasoning-depth estimation) that Brick uses to make sub-50ms routing decisions.
|
| 90 |
|
|
|
|
| 112 |
|
| 113 |
```
|
| 114 |
Qwen3.5-0.8B (frozen)
|
| 115 |
+
+-- Attention Layers x 24
|
| 116 |
+
|-- q_proj <- LoRA(r=16, alpha=32)
|
| 117 |
+
+-- v_proj <- LoRA(r=16, alpha=32)
|
| 118 |
+
+-- Last Hidden State
|
| 119 |
+
+-- Classification Head (3 classes)
|
| 120 |
```
|
| 121 |
|
| 122 |
## Label Definitions
|
| 123 |
|
| 124 |
| Label | Reasoning Steps | Description | Example |
|
| 125 |
|---|---|---|---|
|
| 126 |
+
| **easy** | 1-2 | Surface knowledge, factual recall, simple lookups | "What is the capital of Italy?" |
|
| 127 |
+
| **medium** | 3-5 | Domain familiarity, multi-step reasoning, comparison | "Compare REST and GraphQL for a mobile app backend" |
|
| 128 |
| **hard** | 6+ | Deep expertise, multi-constraint optimization, creative synthesis | "Design a distributed cache eviction policy that minimizes tail latency under bursty traffic" |
|
| 129 |
|
| 130 |
Labels were generated by **Qwen3.5-122B** acting as an LLM judge on 76,831 diverse user prompts. See the [dataset card](https://huggingface.co/datasets/regolo/brick-complexity-extractor) for full labeling methodology.
|
| 131 |
|
| 132 |
## Performance
|
| 133 |
|
| 134 |
+
### Classification Metrics (Test Set -- 3,841 samples)
|
| 135 |
|
| 136 |
| Metric | Value |
|
| 137 |
|---|---|
|
|
|
|
| 199 |
# https://github.com/regolo-ai/brick-SR1
|
| 200 |
```
|
| 201 |
|
| 202 |
+
---
|
| 203 |
+
|
| 204 |
+
## GGUF Quantized Models
|
| 205 |
+
|
| 206 |
+
Pre-built GGUF files are available for inference with [llama.cpp](https://github.com/ggml-org/llama.cpp), [Ollama](https://ollama.com), [LM Studio](https://lmstudio.ai), [vLLM](https://github.com/vllm-project/vllm), and other GGUF-compatible runtimes.
|
| 207 |
+
|
| 208 |
+
These files contain the **full merged model** (base Qwen3.5-0.8B + LoRA adapter merged), so no separate adapter loading is needed.
|
| 209 |
+
|
| 210 |
+
### Available Quantizations
|
| 211 |
+
|
| 212 |
+
| File | Quant | Size | BPW | Notes |
|
| 213 |
+
|---|---|---|---|---|
|
| 214 |
+
| `brick-complexity-extractor-BF16.gguf` | BF16 | 1.5 GB | 16.0 | Full precision, no quality loss |
|
| 215 |
+
| `brick-complexity-extractor-Q8_0.gguf` | Q8_0 | 775 MB | 8.0 | Near-lossless, recommended for accuracy |
|
| 216 |
+
| `brick-complexity-extractor-Q4_K_M.gguf` | Q4_K_M | 494 MB | 5.5 | Best quality/size ratio |
|
| 217 |
+
|
| 218 |
+
### Usage with llama.cpp
|
| 219 |
+
|
| 220 |
+
```bash
|
| 221 |
+
# Download a quantized model
|
| 222 |
+
huggingface-cli download regolo/brick-complexity-extractor \
|
| 223 |
+
brick-complexity-extractor-Q8_0.gguf \
|
| 224 |
+
--local-dir ./models
|
| 225 |
+
|
| 226 |
+
# Run inference
|
| 227 |
+
./llama-cli -m ./models/brick-complexity-extractor-Q8_0.gguf \
|
| 228 |
+
-p "<|im_start|>system
|
| 229 |
+
You are a query difficulty classifier for an LLM routing system.
|
| 230 |
+
Classify each query as easy, medium, or hard based on the cognitive depth and domain expertise required to answer correctly.
|
| 231 |
+
Respond with ONLY one word: easy, medium, or hard.<|im_end|>
|
| 232 |
+
<|im_start|>user
|
| 233 |
+
Classify: What is the capital of France?<|im_end|>
|
| 234 |
+
<|im_start|>assistant
|
| 235 |
+
" \
|
| 236 |
+
-n 5 --temp 0
|
| 237 |
+
```
|
| 238 |
+
|
| 239 |
+
### Usage with Ollama
|
| 240 |
+
|
| 241 |
+
```bash
|
| 242 |
+
# Create a Modelfile
|
| 243 |
+
cat > Modelfile <<EOF
|
| 244 |
+
FROM ./brick-complexity-extractor-Q8_0.gguf
|
| 245 |
+
|
| 246 |
+
SYSTEM """You are a query difficulty classifier for an LLM routing system.
|
| 247 |
+
Classify each query as easy, medium, or hard based on the cognitive depth and domain expertise required to answer correctly.
|
| 248 |
+
Respond with ONLY one word: easy, medium, or hard."""
|
| 249 |
+
|
| 250 |
+
TEMPLATE """<|im_start|>system
|
| 251 |
+
{{ .System }}<|im_end|>
|
| 252 |
+
<|im_start|>user
|
| 253 |
+
Classify: {{ .Prompt }}<|im_end|>
|
| 254 |
+
<|im_start|>assistant
|
| 255 |
+
"""
|
| 256 |
+
|
| 257 |
+
PARAMETER temperature 0
|
| 258 |
+
PARAMETER num_predict 5
|
| 259 |
+
EOF
|
| 260 |
+
|
| 261 |
+
ollama create brick-complexity -f Modelfile
|
| 262 |
+
ollama run brick-complexity "Design a distributed consensus algorithm"
|
| 263 |
+
# Output: hard
|
| 264 |
+
```
|
| 265 |
+
|
| 266 |
+
### Usage with vLLM
|
| 267 |
+
|
| 268 |
+
```python
|
| 269 |
+
from vllm import LLM, SamplingParams
|
| 270 |
+
|
| 271 |
+
llm = LLM(
|
| 272 |
+
model="regolo/brick-complexity-extractor",
|
| 273 |
+
quantization="gguf",
|
| 274 |
+
# Point to a specific GGUF file:
|
| 275 |
+
# model="./brick-complexity-extractor-Q8_0.gguf"
|
| 276 |
+
)
|
| 277 |
+
|
| 278 |
+
sampling_params = SamplingParams(temperature=0, max_tokens=5)
|
| 279 |
+
|
| 280 |
+
prompt = """<|im_start|>system
|
| 281 |
+
You are a query difficulty classifier for an LLM routing system.
|
| 282 |
+
Classify each query as easy, medium, or hard.
|
| 283 |
+
Respond with ONLY one word: easy, medium, or hard.<|im_end|>
|
| 284 |
+
<|im_start|>user
|
| 285 |
+
Classify: Explain the rendering equation from radiometric first principles<|im_end|>
|
| 286 |
+
<|im_start|>assistant
|
| 287 |
+
"""
|
| 288 |
+
|
| 289 |
+
output = llm.generate([prompt], sampling_params)
|
| 290 |
+
print(output[0].outputs[0].text.strip())
|
| 291 |
+
# Output: hard
|
| 292 |
+
```
|
| 293 |
+
|
| 294 |
+
### Important Note on GGUF Inference
|
| 295 |
+
|
| 296 |
+
The GGUF models use **generative text output** (the model generates the word "easy", "medium", or "hard") rather than the logit-based classification used by the LoRA adapter. This means:
|
| 297 |
+
|
| 298 |
+
- **LoRA adapter (recommended for production)**: Uses logit extraction at the last token position for the three label tokens. Faster and more reliable.
|
| 299 |
+
- **GGUF (recommended for local/edge deployment)**: Generates the classification label as text. Slightly lower accuracy but works with any GGUF runtime without Python dependencies.
|
| 300 |
+
|
| 301 |
+
---
|
| 302 |
+
|
| 303 |
## Integration with Brick Semantic Router
|
| 304 |
|
| 305 |
Brick Complexity Extractor is designed to work as a signal within the **Brick Semantic Router** pipeline. In a typical deployment:
|
|
|
|
| 339 |
|
| 340 |
## Intended Uses
|
| 341 |
|
| 342 |
+
### Primary Use Cases
|
| 343 |
+
- **LLM routing**: Classify query complexity to route to the optimal model tier, reducing inference cost by 30-60% compared to always-frontier routing
|
| 344 |
- **Reasoning budget allocation**: Decide how many reasoning tokens to allocate before inference begins
|
| 345 |
- **Traffic shaping**: Balance GPU load across model pools based on real-time complexity distribution
|
| 346 |
- **Cost monitoring**: Track complexity distribution over time to optimize fleet sizing
|
| 347 |
|
| 348 |
+
### Out-of-Scope Uses
|
| 349 |
+
- **Content moderation or safety filtering** -- this model classifies cognitive difficulty, not content safety
|
| 350 |
- **Non-English queries** trained on English data only; accuracy degrades significantly on other languages
|
| 351 |
- **Direct use as a chatbot or generative model** this is a classification adapter, not a generative model
|
| 352 |
|
|
|
|
| 364 |
|---|---|
|
| 365 |
| **Base model** | Qwen/Qwen3.5-0.8B |
|
| 366 |
| **LoRA rank (r)** | 16 |
|
| 367 |
+
| **LoRA alpha** | 32 |
|
| 368 |
| **LoRA dropout** | 0.05 |
|
| 369 |
| **Target modules** | q_proj, v_proj |
|
| 370 |
| **Learning rate** | 2e-4 |
|
|
|
|
| 376 |
| **Training samples** | 65,307 |
|
| 377 |
| **Validation samples** | 7,683 |
|
| 378 |
| **Test samples** | 3,841 |
|
| 379 |
+
| **Training hardware** | 1x NVIDIA A100 80GB |
|
| 380 |
| **Training time** | ~2 hours |
|
| 381 |
| **Framework** | PyTorch + HuggingFace PEFT |
|
| 382 |
|
|
|
|
| 386 |
|
| 387 |
| Metric | Value |
|
| 388 |
|---|---|
|
| 389 |
+
| **Hardware** | 1x NVIDIA A100 80GB |
|
| 390 |
| **Training duration** | ~2 hours |
|
| 391 |
+
| **Estimated CO2** | < 0.5 kg CO2eq |
|
| 392 |
| **Energy source** | Renewable (certified) |
|
| 393 |
| **Location** | Italy (EU) |
|
| 394 |
|
|
|
|
| 411 |
|
| 412 |
<div align="center">
|
| 413 |
|
| 414 |
+
**[Website](https://regolo.ai) | [Docs](https://docs.regolo.ai) | [Discord](https://discord.gg/myuuVFcfJw) | [GitHub](https://github.com/regolo-ai) | [LinkedIn](https://www.linkedin.com/company/regolo-ai/)**
|
| 415 |
|
| 416 |
</div>
|
brick-complexity-extractor-BF16.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6fc8392a811ff1b3dbdb7348110893bac25f912540a58ae7ff4e1cb96ceced92
|
| 3 |
+
size 1516736384
|
brick-complexity-extractor-Q4_K_M.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8bb38e63a7eeabddd729f2cdadfc7bd04b82aea413778e77bd4dee2b03a5489e
|
| 3 |
+
size 529289088
|
brick-complexity-extractor-Q8_0.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1f74b88a1b7149dd9074eed60cadfc7555fca227ddbc1c71ec30a635f7cd3913
|
| 3 |
+
size 811835264
|