Add GGUF quantized models (BF16, Q8_0, Q4_K_M) and update README with GGUF usage section

Files changed (5) hide show

.gitattributes +1 -0
README.md +123 -20
brick-complexity-extractor-BF16.gguf +3 -0
brick-complexity-extractor-Q4_K_M.gguf +3 -0
brick-complexity-extractor-Q8_0.gguf +3 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+*.gguf filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -7,6 +7,7 @@ tags:
   - peft
   - safetensors
   - lora
   - complexity-classification
   - llm-routing
   - query-difficulty
@@ -41,11 +42,11 @@ model-index:
 <div align="center">
-# 🧱 Brick Complexity Extractor
 ### A lightweight LoRA adapter for real-time query complexity classification
-**[Regolo.ai](https://regolo.ai) · [Dataset](https://huggingface.co/datasets/regolo/brick-complexity-extractor) · [Brick SR1 on GitHub](https://github.com/regolo-ai/brick-SR1) · [API Docs](https://docs.regolo.ai)**
 [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
 [![Base Model](https://img.shields.io/badge/Base-Qwen3.5--0.8B-blue)](https://huggingface.co/Qwen/Qwen3.5-0.8B)
@@ -64,6 +65,7 @@ model-index:
 - [Label Definitions](#label-definitions)
 - [Performance](#performance)
 - [Quick Start](#quick-start)
 - [Integration with Brick Semantic Router](#integration-with-brick-semantic-router)
 - [Intended Uses](#intended-uses)
 - [Limitations](#limitations)
@@ -82,7 +84,7 @@ The adapter adds only **~2M trainable parameters** on top of the 0.8B base model
 ## The Problem: Why LLM Routing Needs Complexity Classification
-Not all prompts are equal. A factual recall question ("What is the capital of France?") and a multi-step reasoning task ("Derive the optimal portfolio allocation given these constraints…") require fundamentally different compute budgets. Sending every query to a frontier reasoning model wastes resources; sending hard queries to a lightweight model degrades quality.
 **Brick** solves this by routing each query to the right model tier in real time. Complexity classification is one of several routing signals (alongside keyword matching, domain detection, and reasoning-depth estimation) that Brick uses to make sub-50ms routing decisions.
@@ -110,26 +112,26 @@ The adapter applies LoRA to the query and value projection matrices (`q_proj`, `
 ```
 Qwen3.5-0.8B (frozen)
-    └── Attention Layers × 24
-         ├── q_proj ← LoRA(r=16, α=32)
-         └── v_proj ← LoRA(r=16, α=32)
-    └── Last Hidden State
-         └── Classification Head (3 classes)
 ```
 ## Label Definitions
 | Label | Reasoning Steps | Description | Example |
 |---|---|---|---|
-| **easy** | 1–2 | Surface knowledge, factual recall, simple lookups | "What is the capital of Italy?" |
-| **medium** | 3–5 | Domain familiarity, multi-step reasoning, comparison | "Compare REST and GraphQL for a mobile app backend" |
 | **hard** | 6+ | Deep expertise, multi-constraint optimization, creative synthesis | "Design a distributed cache eviction policy that minimizes tail latency under bursty traffic" |
 Labels were generated by **Qwen3.5-122B** acting as an LLM judge on 76,831 diverse user prompts. See the [dataset card](https://huggingface.co/datasets/regolo/brick-complexity-extractor) for full labeling methodology.
 ## Performance
-### Classification Metrics (Test Set — 3,841 samples)
 | Metric | Value |
 |---|---|
@@ -197,6 +199,107 @@ print(f"Complexity: {predicted}")
 # https://github.com/regolo-ai/brick-SR1
 ```
 ## Integration with Brick Semantic Router
 Brick Complexity Extractor is designed to work as a signal within the **Brick Semantic Router** pipeline. In a typical deployment:
@@ -236,14 +339,14 @@ model_pools:
 ## Intended Uses
-### ✅ Primary Use Cases
-- **LLM routing**: Classify query complexity to route to the optimal model tier, reducing inference cost by 30–60% compared to always-frontier routing
 - **Reasoning budget allocation**: Decide how many reasoning tokens to allocate before inference begins
 - **Traffic shaping**: Balance GPU load across model pools based on real-time complexity distribution
 - **Cost monitoring**: Track complexity distribution over time to optimize fleet sizing
-### ⚠️ Out-of-Scope Uses
-- **Content moderation or safety filtering** — this model classifies cognitive difficulty, not content safety
 - **Non-English queries** trained on English data only; accuracy degrades significantly on other languages
 - **Direct use as a chatbot or generative model** this is a classification adapter, not a generative model
@@ -261,7 +364,7 @@ model_pools:
 |---|---|
 | **Base model** | Qwen/Qwen3.5-0.8B |
 | **LoRA rank (r)** | 16 |
-| **LoRA alpha (α)** | 32 |
 | **LoRA dropout** | 0.05 |
 | **Target modules** | q_proj, v_proj |
 | **Learning rate** | 2e-4 |
@@ -273,7 +376,7 @@ model_pools:
 | **Training samples** | 65,307 |
 | **Validation samples** | 7,683 |
 | **Test samples** | 3,841 |
-| **Training hardware** | 1× NVIDIA A100 80GB |
 | **Training time** | ~2 hours |
 | **Framework** | PyTorch + HuggingFace PEFT |
@@ -283,9 +386,9 @@ Regolo.ai is committed to sustainable AI. This model was trained on GPU infrastr
 | Metric | Value |
 |---|---|
-| **Hardware** | 1× NVIDIA A100 80GB |
 | **Training duration** | ~2 hours |
-| **Estimated CO₂** | < 0.5 kg CO₂eq |
 | **Energy source** | Renewable (certified) |
 | **Location** | Italy (EU) |
@@ -308,6 +411,6 @@ Regolo.ai is committed to sustainable AI. This model was trained on GPU infrastr
 <div align="center">
-**[Website](https://regolo.ai) · [Docs](https://docs.regolo.ai) · [Discord](https://discord.gg/myuuVFcfJw) · [GitHub](https://github.com/regolo-ai) · [LinkedIn](https://www.linkedin.com/company/regolo-ai/)**
 </div>

   - peft
   - safetensors
   - lora
+  - gguf
   - complexity-classification
   - llm-routing
   - query-difficulty
 <div align="center">
+# Brick Complexity Extractor
 ### A lightweight LoRA adapter for real-time query complexity classification
+**[Regolo.ai](https://regolo.ai) | [Dataset](https://huggingface.co/datasets/regolo/brick-complexity-extractor) | [Brick SR1 on GitHub](https://github.com/regolo-ai/brick-SR1) | [API Docs](https://docs.regolo.ai)**
 [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
 [![Base Model](https://img.shields.io/badge/Base-Qwen3.5--0.8B-blue)](https://huggingface.co/Qwen/Qwen3.5-0.8B)
 - [Label Definitions](#label-definitions)
 - [Performance](#performance)
 - [Quick Start](#quick-start)
+- [GGUF Quantized Models](#gguf-quantized-models)
 - [Integration with Brick Semantic Router](#integration-with-brick-semantic-router)
 - [Intended Uses](#intended-uses)
 - [Limitations](#limitations)
 ## The Problem: Why LLM Routing Needs Complexity Classification
+Not all prompts are equal. A factual recall question ("What is the capital of France?") and a multi-step reasoning task ("Derive the optimal portfolio allocation given these constraints...") require fundamentally different compute budgets. Sending every query to a frontier reasoning model wastes resources; sending hard queries to a lightweight model degrades quality.
 **Brick** solves this by routing each query to the right model tier in real time. Complexity classification is one of several routing signals (alongside keyword matching, domain detection, and reasoning-depth estimation) that Brick uses to make sub-50ms routing decisions.
 ```
 Qwen3.5-0.8B (frozen)
+    +-- Attention Layers x 24
+         |-- q_proj <- LoRA(r=16, alpha=32)
+         +-- v_proj <- LoRA(r=16, alpha=32)
+    +-- Last Hidden State
+         +-- Classification Head (3 classes)
 ```
 ## Label Definitions
 | Label | Reasoning Steps | Description | Example |
 |---|---|---|---|
+| **easy** | 1-2 | Surface knowledge, factual recall, simple lookups | "What is the capital of Italy?" |
+| **medium** | 3-5 | Domain familiarity, multi-step reasoning, comparison | "Compare REST and GraphQL for a mobile app backend" |
 | **hard** | 6+ | Deep expertise, multi-constraint optimization, creative synthesis | "Design a distributed cache eviction policy that minimizes tail latency under bursty traffic" |
 Labels were generated by **Qwen3.5-122B** acting as an LLM judge on 76,831 diverse user prompts. See the [dataset card](https://huggingface.co/datasets/regolo/brick-complexity-extractor) for full labeling methodology.
 ## Performance
+### Classification Metrics (Test Set -- 3,841 samples)
 | Metric | Value |
 |---|---|
 # https://github.com/regolo-ai/brick-SR1
 ```
+---
+## GGUF Quantized Models
+Pre-built GGUF files are available for inference with [llama.cpp](https://github.com/ggml-org/llama.cpp), [Ollama](https://ollama.com), [LM Studio](https://lmstudio.ai), [vLLM](https://github.com/vllm-project/vllm), and other GGUF-compatible runtimes.
+These files contain the **full merged model** (base Qwen3.5-0.8B + LoRA adapter merged), so no separate adapter loading is needed.
+### Available Quantizations
+| File | Quant | Size | BPW | Notes |
+|---|---|---|---|---|
+| `brick-complexity-extractor-BF16.gguf` | BF16 | 1.5 GB | 16.0 | Full precision, no quality loss |
+| `brick-complexity-extractor-Q8_0.gguf` | Q8_0 | 775 MB | 8.0 | Near-lossless, recommended for accuracy |
+| `brick-complexity-extractor-Q4_K_M.gguf` | Q4_K_M | 494 MB | 5.5 | Best quality/size ratio |
+### Usage with llama.cpp
+```bash
+# Download a quantized model
+huggingface-cli download regolo/brick-complexity-extractor \
+    brick-complexity-extractor-Q8_0.gguf \
+    --local-dir ./models
+# Run inference
+./llama-cli -m ./models/brick-complexity-extractor-Q8_0.gguf \
+    -p "<|im_start|>system
+You are a query difficulty classifier for an LLM routing system.
+Classify each query as easy, medium, or hard based on the cognitive depth and domain expertise required to answer correctly.
+Respond with ONLY one word: easy, medium, or hard.<|im_end|>
+<|im_start|>user
+Classify: What is the capital of France?<|im_end|>
+<|im_start|>assistant
+" \
+    -n 5 --temp 0
+```
+### Usage with Ollama
+```bash
+# Create a Modelfile
+cat > Modelfile <<EOF
+FROM ./brick-complexity-extractor-Q8_0.gguf
+SYSTEM """You are a query difficulty classifier for an LLM routing system.
+Classify each query as easy, medium, or hard based on the cognitive depth and domain expertise required to answer correctly.
+Respond with ONLY one word: easy, medium, or hard."""
+TEMPLATE """<|im_start|>system
+{{ .System }}<|im_end|>
+<|im_start|>user
+Classify: {{ .Prompt }}<|im_end|>
+<|im_start|>assistant
+"""
+PARAMETER temperature 0
+PARAMETER num_predict 5
+EOF
+ollama create brick-complexity -f Modelfile
+ollama run brick-complexity "Design a distributed consensus algorithm"
+# Output: hard
+```
+### Usage with vLLM
+```python
+from vllm import LLM, SamplingParams
+llm = LLM(
+    model="regolo/brick-complexity-extractor",
+    quantization="gguf",
+    # Point to a specific GGUF file:
+    # model="./brick-complexity-extractor-Q8_0.gguf"
+)
+sampling_params = SamplingParams(temperature=0, max_tokens=5)
+prompt = """<|im_start|>system
+You are a query difficulty classifier for an LLM routing system.
+Classify each query as easy, medium, or hard.
+Respond with ONLY one word: easy, medium, or hard.<|im_end|>
+<|im_start|>user
+Classify: Explain the rendering equation from radiometric first principles<|im_end|>
+<|im_start|>assistant
+"""
+output = llm.generate([prompt], sampling_params)
+print(output[0].outputs[0].text.strip())
+# Output: hard
+```
+### Important Note on GGUF Inference
+The GGUF models use **generative text output** (the model generates the word "easy", "medium", or "hard") rather than the logit-based classification used by the LoRA adapter. This means:
+- **LoRA adapter (recommended for production)**: Uses logit extraction at the last token position for the three label tokens. Faster and more reliable.
+- **GGUF (recommended for local/edge deployment)**: Generates the classification label as text. Slightly lower accuracy but works with any GGUF runtime without Python dependencies.
+---
 ## Integration with Brick Semantic Router
 Brick Complexity Extractor is designed to work as a signal within the **Brick Semantic Router** pipeline. In a typical deployment:
 ## Intended Uses
+### Primary Use Cases
+- **LLM routing**: Classify query complexity to route to the optimal model tier, reducing inference cost by 30-60% compared to always-frontier routing
 - **Reasoning budget allocation**: Decide how many reasoning tokens to allocate before inference begins
 - **Traffic shaping**: Balance GPU load across model pools based on real-time complexity distribution
 - **Cost monitoring**: Track complexity distribution over time to optimize fleet sizing
+### Out-of-Scope Uses
+- **Content moderation or safety filtering** -- this model classifies cognitive difficulty, not content safety
 - **Non-English queries** trained on English data only; accuracy degrades significantly on other languages
 - **Direct use as a chatbot or generative model** this is a classification adapter, not a generative model
 |---|---|
 | **Base model** | Qwen/Qwen3.5-0.8B |
 | **LoRA rank (r)** | 16 |
+| **LoRA alpha** | 32 |
 | **LoRA dropout** | 0.05 |
 | **Target modules** | q_proj, v_proj |
 | **Learning rate** | 2e-4 |
 | **Training samples** | 65,307 |
 | **Validation samples** | 7,683 |
 | **Test samples** | 3,841 |
+| **Training hardware** | 1x NVIDIA A100 80GB |
 | **Training time** | ~2 hours |
 | **Framework** | PyTorch + HuggingFace PEFT |
 | Metric | Value |
 |---|---|
+| **Hardware** | 1x NVIDIA A100 80GB |
 | **Training duration** | ~2 hours |
+| **Estimated CO2** | < 0.5 kg CO2eq |
 | **Energy source** | Renewable (certified) |
 | **Location** | Italy (EU) |
 <div align="center">
+**[Website](https://regolo.ai) | [Docs](https://docs.regolo.ai) | [Discord](https://discord.gg/myuuVFcfJw) | [GitHub](https://github.com/regolo-ai) | [LinkedIn](https://www.linkedin.com/company/regolo-ai/)**
 </div>

brick-complexity-extractor-BF16.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6fc8392a811ff1b3dbdb7348110893bac25f912540a58ae7ff4e1cb96ceced92
+size 1516736384

brick-complexity-extractor-Q4_K_M.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8bb38e63a7eeabddd729f2cdadfc7bd04b82aea413778e77bd4dee2b03a5489e
+size 529289088

brick-complexity-extractor-Q8_0.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1f74b88a1b7149dd9074eed60cadfc7555fca227ddbc1c71ec30a635f7cd3913
+size 811835264