root commited on
Commit
60a6bce
·
1 Parent(s): e6d73e1

Add GGUF quantized models (BF16, Q8_0, Q4_K_M) and update README with GGUF usage section

Browse files
.gitattributes CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ *.gguf filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -7,6 +7,7 @@ tags:
7
  - peft
8
  - safetensors
9
  - lora
 
10
  - complexity-classification
11
  - llm-routing
12
  - query-difficulty
@@ -41,11 +42,11 @@ model-index:
41
 
42
  <div align="center">
43
 
44
- # 🧱 Brick Complexity Extractor
45
 
46
  ### A lightweight LoRA adapter for real-time query complexity classification
47
 
48
- **[Regolo.ai](https://regolo.ai) · [Dataset](https://huggingface.co/datasets/regolo/brick-complexity-extractor) · [Brick SR1 on GitHub](https://github.com/regolo-ai/brick-SR1) · [API Docs](https://docs.regolo.ai)**
49
 
50
  [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
51
  [![Base Model](https://img.shields.io/badge/Base-Qwen3.5--0.8B-blue)](https://huggingface.co/Qwen/Qwen3.5-0.8B)
@@ -64,6 +65,7 @@ model-index:
64
  - [Label Definitions](#label-definitions)
65
  - [Performance](#performance)
66
  - [Quick Start](#quick-start)
 
67
  - [Integration with Brick Semantic Router](#integration-with-brick-semantic-router)
68
  - [Intended Uses](#intended-uses)
69
  - [Limitations](#limitations)
@@ -82,7 +84,7 @@ The adapter adds only **~2M trainable parameters** on top of the 0.8B base model
82
 
83
  ## The Problem: Why LLM Routing Needs Complexity Classification
84
 
85
- Not all prompts are equal. A factual recall question ("What is the capital of France?") and a multi-step reasoning task ("Derive the optimal portfolio allocation given these constraints") require fundamentally different compute budgets. Sending every query to a frontier reasoning model wastes resources; sending hard queries to a lightweight model degrades quality.
86
 
87
  **Brick** solves this by routing each query to the right model tier in real time. Complexity classification is one of several routing signals (alongside keyword matching, domain detection, and reasoning-depth estimation) that Brick uses to make sub-50ms routing decisions.
88
 
@@ -110,26 +112,26 @@ The adapter applies LoRA to the query and value projection matrices (`q_proj`, `
110
 
111
  ```
112
  Qwen3.5-0.8B (frozen)
113
- └── Attention Layers × 24
114
- ├── q_proj LoRA(r=16, α=32)
115
- └── v_proj LoRA(r=16, α=32)
116
- └── Last Hidden State
117
- └── Classification Head (3 classes)
118
  ```
119
 
120
  ## Label Definitions
121
 
122
  | Label | Reasoning Steps | Description | Example |
123
  |---|---|---|---|
124
- | **easy** | 12 | Surface knowledge, factual recall, simple lookups | "What is the capital of Italy?" |
125
- | **medium** | 35 | Domain familiarity, multi-step reasoning, comparison | "Compare REST and GraphQL for a mobile app backend" |
126
  | **hard** | 6+ | Deep expertise, multi-constraint optimization, creative synthesis | "Design a distributed cache eviction policy that minimizes tail latency under bursty traffic" |
127
 
128
  Labels were generated by **Qwen3.5-122B** acting as an LLM judge on 76,831 diverse user prompts. See the [dataset card](https://huggingface.co/datasets/regolo/brick-complexity-extractor) for full labeling methodology.
129
 
130
  ## Performance
131
 
132
- ### Classification Metrics (Test Set 3,841 samples)
133
 
134
  | Metric | Value |
135
  |---|---|
@@ -197,6 +199,107 @@ print(f"Complexity: {predicted}")
197
  # https://github.com/regolo-ai/brick-SR1
198
  ```
199
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
200
  ## Integration with Brick Semantic Router
201
 
202
  Brick Complexity Extractor is designed to work as a signal within the **Brick Semantic Router** pipeline. In a typical deployment:
@@ -236,14 +339,14 @@ model_pools:
236
 
237
  ## Intended Uses
238
 
239
- ### Primary Use Cases
240
- - **LLM routing**: Classify query complexity to route to the optimal model tier, reducing inference cost by 3060% compared to always-frontier routing
241
  - **Reasoning budget allocation**: Decide how many reasoning tokens to allocate before inference begins
242
  - **Traffic shaping**: Balance GPU load across model pools based on real-time complexity distribution
243
  - **Cost monitoring**: Track complexity distribution over time to optimize fleet sizing
244
 
245
- ### ⚠️ Out-of-Scope Uses
246
- - **Content moderation or safety filtering** this model classifies cognitive difficulty, not content safety
247
  - **Non-English queries** trained on English data only; accuracy degrades significantly on other languages
248
  - **Direct use as a chatbot or generative model** this is a classification adapter, not a generative model
249
 
@@ -261,7 +364,7 @@ model_pools:
261
  |---|---|
262
  | **Base model** | Qwen/Qwen3.5-0.8B |
263
  | **LoRA rank (r)** | 16 |
264
- | **LoRA alpha (α)** | 32 |
265
  | **LoRA dropout** | 0.05 |
266
  | **Target modules** | q_proj, v_proj |
267
  | **Learning rate** | 2e-4 |
@@ -273,7 +376,7 @@ model_pools:
273
  | **Training samples** | 65,307 |
274
  | **Validation samples** | 7,683 |
275
  | **Test samples** | 3,841 |
276
- | **Training hardware** | NVIDIA A100 80GB |
277
  | **Training time** | ~2 hours |
278
  | **Framework** | PyTorch + HuggingFace PEFT |
279
 
@@ -283,9 +386,9 @@ Regolo.ai is committed to sustainable AI. This model was trained on GPU infrastr
283
 
284
  | Metric | Value |
285
  |---|---|
286
- | **Hardware** | NVIDIA A100 80GB |
287
  | **Training duration** | ~2 hours |
288
- | **Estimated CO₂** | < 0.5 kg CO₂eq |
289
  | **Energy source** | Renewable (certified) |
290
  | **Location** | Italy (EU) |
291
 
@@ -308,6 +411,6 @@ Regolo.ai is committed to sustainable AI. This model was trained on GPU infrastr
308
 
309
  <div align="center">
310
 
311
- **[Website](https://regolo.ai) · [Docs](https://docs.regolo.ai) · [Discord](https://discord.gg/myuuVFcfJw) · [GitHub](https://github.com/regolo-ai) · [LinkedIn](https://www.linkedin.com/company/regolo-ai/)**
312
 
313
  </div>
 
7
  - peft
8
  - safetensors
9
  - lora
10
+ - gguf
11
  - complexity-classification
12
  - llm-routing
13
  - query-difficulty
 
42
 
43
  <div align="center">
44
 
45
+ # Brick Complexity Extractor
46
 
47
  ### A lightweight LoRA adapter for real-time query complexity classification
48
 
49
+ **[Regolo.ai](https://regolo.ai) | [Dataset](https://huggingface.co/datasets/regolo/brick-complexity-extractor) | [Brick SR1 on GitHub](https://github.com/regolo-ai/brick-SR1) | [API Docs](https://docs.regolo.ai)**
50
 
51
  [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
52
  [![Base Model](https://img.shields.io/badge/Base-Qwen3.5--0.8B-blue)](https://huggingface.co/Qwen/Qwen3.5-0.8B)
 
65
  - [Label Definitions](#label-definitions)
66
  - [Performance](#performance)
67
  - [Quick Start](#quick-start)
68
+ - [GGUF Quantized Models](#gguf-quantized-models)
69
  - [Integration with Brick Semantic Router](#integration-with-brick-semantic-router)
70
  - [Intended Uses](#intended-uses)
71
  - [Limitations](#limitations)
 
84
 
85
  ## The Problem: Why LLM Routing Needs Complexity Classification
86
 
87
+ Not all prompts are equal. A factual recall question ("What is the capital of France?") and a multi-step reasoning task ("Derive the optimal portfolio allocation given these constraints...") require fundamentally different compute budgets. Sending every query to a frontier reasoning model wastes resources; sending hard queries to a lightweight model degrades quality.
88
 
89
  **Brick** solves this by routing each query to the right model tier in real time. Complexity classification is one of several routing signals (alongside keyword matching, domain detection, and reasoning-depth estimation) that Brick uses to make sub-50ms routing decisions.
90
 
 
112
 
113
  ```
114
  Qwen3.5-0.8B (frozen)
115
+ +-- Attention Layers x 24
116
+ |-- q_proj <- LoRA(r=16, alpha=32)
117
+ +-- v_proj <- LoRA(r=16, alpha=32)
118
+ +-- Last Hidden State
119
+ +-- Classification Head (3 classes)
120
  ```
121
 
122
  ## Label Definitions
123
 
124
  | Label | Reasoning Steps | Description | Example |
125
  |---|---|---|---|
126
+ | **easy** | 1-2 | Surface knowledge, factual recall, simple lookups | "What is the capital of Italy?" |
127
+ | **medium** | 3-5 | Domain familiarity, multi-step reasoning, comparison | "Compare REST and GraphQL for a mobile app backend" |
128
  | **hard** | 6+ | Deep expertise, multi-constraint optimization, creative synthesis | "Design a distributed cache eviction policy that minimizes tail latency under bursty traffic" |
129
 
130
  Labels were generated by **Qwen3.5-122B** acting as an LLM judge on 76,831 diverse user prompts. See the [dataset card](https://huggingface.co/datasets/regolo/brick-complexity-extractor) for full labeling methodology.
131
 
132
  ## Performance
133
 
134
+ ### Classification Metrics (Test Set -- 3,841 samples)
135
 
136
  | Metric | Value |
137
  |---|---|
 
199
  # https://github.com/regolo-ai/brick-SR1
200
  ```
201
 
202
+ ---
203
+
204
+ ## GGUF Quantized Models
205
+
206
+ Pre-built GGUF files are available for inference with [llama.cpp](https://github.com/ggml-org/llama.cpp), [Ollama](https://ollama.com), [LM Studio](https://lmstudio.ai), [vLLM](https://github.com/vllm-project/vllm), and other GGUF-compatible runtimes.
207
+
208
+ These files contain the **full merged model** (base Qwen3.5-0.8B + LoRA adapter merged), so no separate adapter loading is needed.
209
+
210
+ ### Available Quantizations
211
+
212
+ | File | Quant | Size | BPW | Notes |
213
+ |---|---|---|---|---|
214
+ | `brick-complexity-extractor-BF16.gguf` | BF16 | 1.5 GB | 16.0 | Full precision, no quality loss |
215
+ | `brick-complexity-extractor-Q8_0.gguf` | Q8_0 | 775 MB | 8.0 | Near-lossless, recommended for accuracy |
216
+ | `brick-complexity-extractor-Q4_K_M.gguf` | Q4_K_M | 494 MB | 5.5 | Best quality/size ratio |
217
+
218
+ ### Usage with llama.cpp
219
+
220
+ ```bash
221
+ # Download a quantized model
222
+ huggingface-cli download regolo/brick-complexity-extractor \
223
+ brick-complexity-extractor-Q8_0.gguf \
224
+ --local-dir ./models
225
+
226
+ # Run inference
227
+ ./llama-cli -m ./models/brick-complexity-extractor-Q8_0.gguf \
228
+ -p "<|im_start|>system
229
+ You are a query difficulty classifier for an LLM routing system.
230
+ Classify each query as easy, medium, or hard based on the cognitive depth and domain expertise required to answer correctly.
231
+ Respond with ONLY one word: easy, medium, or hard.<|im_end|>
232
+ <|im_start|>user
233
+ Classify: What is the capital of France?<|im_end|>
234
+ <|im_start|>assistant
235
+ " \
236
+ -n 5 --temp 0
237
+ ```
238
+
239
+ ### Usage with Ollama
240
+
241
+ ```bash
242
+ # Create a Modelfile
243
+ cat > Modelfile <<EOF
244
+ FROM ./brick-complexity-extractor-Q8_0.gguf
245
+
246
+ SYSTEM """You are a query difficulty classifier for an LLM routing system.
247
+ Classify each query as easy, medium, or hard based on the cognitive depth and domain expertise required to answer correctly.
248
+ Respond with ONLY one word: easy, medium, or hard."""
249
+
250
+ TEMPLATE """<|im_start|>system
251
+ {{ .System }}<|im_end|>
252
+ <|im_start|>user
253
+ Classify: {{ .Prompt }}<|im_end|>
254
+ <|im_start|>assistant
255
+ """
256
+
257
+ PARAMETER temperature 0
258
+ PARAMETER num_predict 5
259
+ EOF
260
+
261
+ ollama create brick-complexity -f Modelfile
262
+ ollama run brick-complexity "Design a distributed consensus algorithm"
263
+ # Output: hard
264
+ ```
265
+
266
+ ### Usage with vLLM
267
+
268
+ ```python
269
+ from vllm import LLM, SamplingParams
270
+
271
+ llm = LLM(
272
+ model="regolo/brick-complexity-extractor",
273
+ quantization="gguf",
274
+ # Point to a specific GGUF file:
275
+ # model="./brick-complexity-extractor-Q8_0.gguf"
276
+ )
277
+
278
+ sampling_params = SamplingParams(temperature=0, max_tokens=5)
279
+
280
+ prompt = """<|im_start|>system
281
+ You are a query difficulty classifier for an LLM routing system.
282
+ Classify each query as easy, medium, or hard.
283
+ Respond with ONLY one word: easy, medium, or hard.<|im_end|>
284
+ <|im_start|>user
285
+ Classify: Explain the rendering equation from radiometric first principles<|im_end|>
286
+ <|im_start|>assistant
287
+ """
288
+
289
+ output = llm.generate([prompt], sampling_params)
290
+ print(output[0].outputs[0].text.strip())
291
+ # Output: hard
292
+ ```
293
+
294
+ ### Important Note on GGUF Inference
295
+
296
+ The GGUF models use **generative text output** (the model generates the word "easy", "medium", or "hard") rather than the logit-based classification used by the LoRA adapter. This means:
297
+
298
+ - **LoRA adapter (recommended for production)**: Uses logit extraction at the last token position for the three label tokens. Faster and more reliable.
299
+ - **GGUF (recommended for local/edge deployment)**: Generates the classification label as text. Slightly lower accuracy but works with any GGUF runtime without Python dependencies.
300
+
301
+ ---
302
+
303
  ## Integration with Brick Semantic Router
304
 
305
  Brick Complexity Extractor is designed to work as a signal within the **Brick Semantic Router** pipeline. In a typical deployment:
 
339
 
340
  ## Intended Uses
341
 
342
+ ### Primary Use Cases
343
+ - **LLM routing**: Classify query complexity to route to the optimal model tier, reducing inference cost by 30-60% compared to always-frontier routing
344
  - **Reasoning budget allocation**: Decide how many reasoning tokens to allocate before inference begins
345
  - **Traffic shaping**: Balance GPU load across model pools based on real-time complexity distribution
346
  - **Cost monitoring**: Track complexity distribution over time to optimize fleet sizing
347
 
348
+ ### Out-of-Scope Uses
349
+ - **Content moderation or safety filtering** -- this model classifies cognitive difficulty, not content safety
350
  - **Non-English queries** trained on English data only; accuracy degrades significantly on other languages
351
  - **Direct use as a chatbot or generative model** this is a classification adapter, not a generative model
352
 
 
364
  |---|---|
365
  | **Base model** | Qwen/Qwen3.5-0.8B |
366
  | **LoRA rank (r)** | 16 |
367
+ | **LoRA alpha** | 32 |
368
  | **LoRA dropout** | 0.05 |
369
  | **Target modules** | q_proj, v_proj |
370
  | **Learning rate** | 2e-4 |
 
376
  | **Training samples** | 65,307 |
377
  | **Validation samples** | 7,683 |
378
  | **Test samples** | 3,841 |
379
+ | **Training hardware** | 1x NVIDIA A100 80GB |
380
  | **Training time** | ~2 hours |
381
  | **Framework** | PyTorch + HuggingFace PEFT |
382
 
 
386
 
387
  | Metric | Value |
388
  |---|---|
389
+ | **Hardware** | 1x NVIDIA A100 80GB |
390
  | **Training duration** | ~2 hours |
391
+ | **Estimated CO2** | < 0.5 kg CO2eq |
392
  | **Energy source** | Renewable (certified) |
393
  | **Location** | Italy (EU) |
394
 
 
411
 
412
  <div align="center">
413
 
414
+ **[Website](https://regolo.ai) | [Docs](https://docs.regolo.ai) | [Discord](https://discord.gg/myuuVFcfJw) | [GitHub](https://github.com/regolo-ai) | [LinkedIn](https://www.linkedin.com/company/regolo-ai/)**
415
 
416
  </div>
brick-complexity-extractor-BF16.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6fc8392a811ff1b3dbdb7348110893bac25f912540a58ae7ff4e1cb96ceced92
3
+ size 1516736384
brick-complexity-extractor-Q4_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8bb38e63a7eeabddd729f2cdadfc7bd04b82aea413778e77bd4dee2b03a5489e
3
+ size 529289088
brick-complexity-extractor-Q8_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1f74b88a1b7149dd9074eed60cadfc7555fca227ddbc1c71ec30a635f7cd3913
3
+ size 811835264