Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -41,229 +41,197 @@ model-index:
|
|
| 41 |
name: Avg Latency (ms, CUDA)
|
| 42 |
---
|
| 43 |
|
| 44 |
-
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|------|-------------------|---------------|-------------|
|
| 54 |
-
| Simple (33) | 87.9% | 81.8% | -6.1% |
|
| 55 |
-
| **Medium (14)** | **14.3%** | **85.7%** | **+71.4%** |
|
| 56 |
-
| Complex (7) | 100.0% | 85.7% | -14.3% |
|
| 57 |
-
| **Overall (54)** | **70.4%** | **83.3%** | **+13.0%** |
|
| 58 |
|
| 59 |
-
|
| 60 |
|
| 61 |
-
###
|
| 62 |
|
| 63 |
-
|
|
| 64 |
-
|---
|
| 65 |
-
|
|
| 66 |
-
|
|
| 67 |
-
|
|
| 68 |
|
| 69 |
-
|
| 70 |
|
| 71 |
-
|
| 72 |
|
| 73 |
-
|
| 74 |
-
|---------|-----------|-------------|---------------------|
|
| 75 |
-
| Transformers FP16 | ~3.0 GB | 196ms | baseline |
|
| 76 |
-
| GGUF Q8_0 (CPU) | 1.6 GB | 768ms | 3.9x slower |
|
| 77 |
-
| **GGUF Q8_0 (CUDA)** | **1.6 GB** | **62ms** | **3.2x faster** |
|
| 78 |
|
| 79 |
-
##
|
| 80 |
|
| 81 |
-
|
| 82 |
-
|---------|---------------|-------------------|
|
| 83 |
-
| Transformers FP16 | 72.2% | 81.5% |
|
| 84 |
-
| GGUF Q8_0 | 70.4% | 83.3% |
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
-
|
| 91 |
|
| 92 |
-
|
| 93 |
-
|---------|----------|-------------|---------|
|
| 94 |
-
| Stock | 72.2% | 196ms | Baseline |
|
| 95 |
-
| **ModelGate-Router (No-CoT)** | **81.5%** | **198ms** | **Best tradeoff** |
|
| 96 |
-
| ModelGate-Router (CoT) | 61.1% | 1,787ms | Overfit, 9x slower |
|
| 97 |
|
| 98 |
-
|
| 99 |
|
| 100 |
-
|
| 101 |
|
| 102 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
```
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
ModelGate-Router.Q8_0.gguf (1.6 GB, production-ready)
|
| 117 |
```
|
| 118 |
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
|
|
|
| 122 |
|
| 123 |
-
|
| 124 |
-
|------|------|-------------|
|
| 125 |
-
| `ModelGate-Router.Q8_0.gguf` | 1.6 GB | ModelGate-Router — GGUF Q8_0, deploy with llama.cpp |
|
| 126 |
-
| `stock_arch_router.Q8_0.gguf` | 1.6 GB | Stock Arch-Router in GGUF Q8_0, for comparison |
|
| 127 |
-
| `ModelGate-Router-LoRA/` | 157 MB | ModelGate-Router LoRA adapter (best model) |
|
| 128 |
-
| `modelgate_arch_router_lora/` | 157 MB | CoT LoRA adapter (for reference only) |
|
| 129 |
-
|
| 130 |
-
### Data
|
| 131 |
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
|
|
|
|
|
|
| 136 |
|
| 137 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
-
|
| 140 |
-
|------|-------------|
|
| 141 |
-
| `grpo_finetune_arch_router.ipynb` | CoT training notebook (Colab/local) |
|
| 142 |
-
| `grpo_run_nocot.py` | No-CoT training script (the one that produced the best model) |
|
| 143 |
-
| `export_gguf.py` | Merges LoRA + converts to GGUF Q8_0 |
|
| 144 |
-
| `bench_gguf.py` | Benchmarks GGUF models via llama.cpp (accuracy + latency) |
|
| 145 |
-
| `bench_stock_vs_finetune.py` | Benchmarks via Transformers (FP16, 3-way comparison) |
|
| 146 |
|
| 147 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
-
|
| 150 |
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
| insurance_claims | 46 |
|
| 155 |
-
| device_protection | 37 |
|
| 156 |
-
| general | 38 |
|
| 157 |
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
| medium | 51 | "Compare the protection plans available for my new laptop..." |
|
| 162 |
-
| complex | 26 | "Analyze the multi-party liability exposure across claims..." |
|
| 163 |
|
| 164 |
-
|
|
|
|
|
|
|
|
|
|
| 165 |
|
| 166 |
-
|
| 167 |
|
| 168 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
|
|
|
| 173 |
|
| 174 |
-
##
|
| 175 |
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
|
|
|
| 182 |
|
| 183 |
-
#
|
|
|
|
| 184 |
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
| Method | GRPO via Unsloth + TRL |
|
| 189 |
-
| LoRA rank | 32 |
|
| 190 |
-
| Trainable params | 36.9M / 1.58B (2.3%) |
|
| 191 |
-
| Training steps | 150 |
|
| 192 |
-
| Training time | **2.5 minutes** |
|
| 193 |
-
| Hardware | RTX 3080 Laptop 8GB |
|
| 194 |
-
| VRAM usage | ~6 GB (4-bit quantized during training) |
|
| 195 |
-
| Generations per prompt | 4 |
|
| 196 |
-
| Learning rate | 5e-6 |
|
| 197 |
-
| Max completion length | 64 tokens |
|
| 198 |
|
| 199 |
-
##
|
| 200 |
|
| 201 |
-
|
| 202 |
|
| 203 |
```bash
|
| 204 |
-
#
|
| 205 |
python finetuning/grpo_run_nocot.py
|
| 206 |
-
# Output: ModelGate-Router-LoRA/
|
| 207 |
-
```
|
| 208 |
|
| 209 |
-
#
|
| 210 |
-
|
| 211 |
-
```bash
|
| 212 |
python finetuning/export_gguf.py nocot
|
| 213 |
-
# Output: finetuning/ModelGate-Router.Q8_0.gguf
|
| 214 |
-
```
|
| 215 |
|
| 216 |
-
#
|
| 217 |
-
|
| 218 |
-
```bash
|
| 219 |
-
# GGUF benchmark (requires llama-cpp-python with CUDA)
|
| 220 |
python finetuning/bench_gguf.py
|
| 221 |
-
|
| 222 |
-
# Transformers FP16 benchmark (3-way: stock vs no-CoT vs CoT)
|
| 223 |
-
python finetuning/bench_stock_vs_finetune.py
|
| 224 |
```
|
| 225 |
|
| 226 |
-
##
|
| 227 |
-
|
| 228 |
-
The recommended deployment uses `ModelGate-Router.Q8_0.gguf` with llama.cpp:
|
| 229 |
-
|
| 230 |
-
```python
|
| 231 |
-
from llama_cpp import Llama
|
| 232 |
-
|
| 233 |
-
model = Llama(
|
| 234 |
-
model_path="finetuning/ModelGate-Router.Q8_0.gguf",
|
| 235 |
-
n_ctx=512,
|
| 236 |
-
n_gpu_layers=-1, # All layers on GPU
|
| 237 |
-
)
|
| 238 |
|
| 239 |
-
# Classify a query
|
| 240 |
-
response = model.create_chat_completion(
|
| 241 |
-
messages=[{"role": "user", "content": routing_prompt}],
|
| 242 |
-
max_tokens=30,
|
| 243 |
-
temperature=0,
|
| 244 |
-
)
|
| 245 |
-
route = json.loads(response["choices"][0]["message"]["content"])["route"]
|
| 246 |
-
# route is "simple", "medium", or "complex"
|
| 247 |
```
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
#
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
- [TRL GRPOTrainer](https://huggingface.co/docs/trl/main/en/grpo_trainer) — GRPO implementation
|
| 269 |
-
- [llama.cpp](https://github.com/ggerganov/llama.cpp) — GGUF inference engine
|
|
|
|
| 41 |
name: Avg Latency (ms, CUDA)
|
| 42 |
---
|
| 43 |
|
| 44 |
+
<p align="center">
|
| 45 |
+
<img src="banner.svg" alt="ModelGate Banner" width="100%"/>
|
| 46 |
+
</p>
|
| 47 |
|
| 48 |
+
# ModelGate
|
| 49 |
|
| 50 |
+
**Intelligent AI Routing - Built from Your Contracts**
|
| 51 |
|
| 52 |
+
One line of code changed. Millions of premium calls rerouted.
|
| 53 |
|
| 54 |
+
ModelGate is a contract-aware AI control plane that ingests customer contracts, extracts SLA/privacy/routing constraints, and generates an OpenAI-compatible endpoint that automatically routes every request to the optimal model. Simple queries go to cheap models. Complex queries go to premium ones. Contract compliance is enforced per request, automatically.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
+
**3rd Place** at the KSU Social Good Hackathon 2026 - Assurant Track.
|
| 57 |
|
| 58 |
+
### Team Agents Assemble
|
| 59 |
|
| 60 |
+
| | Role |
|
| 61 |
+
|---|---|
|
| 62 |
+
| **[Aaryan Kapoor](https://www.linkedin.com/in/theaaryankapoor/)** | Lead Architect & AI Engineer |
|
| 63 |
+
| **[Pradyumna Kumar](https://www.linkedin.com/in/pradyum-kumar/)** | Platform Architect & Frontend |
|
| 64 |
+
| **[Danny Tran](https://www.linkedin.com/in/nam-tr%E1%BA%A7n-02973b2b6/)** | Design & Presentation Lead |
|
| 65 |
|
| 66 |
+
## Why
|
| 67 |
|
| 68 |
+
Over 30 new LLMs launched in the past month alone. No team has time to evaluate them all - so they pick one premium model and send everything to it. The result: 50-90% of enterprise AI spend is wasted on over-provisioned models, and premium models consume 180x more energy per query than small ones.
|
| 69 |
|
| 70 |
+
ModelGate fixes this. You change one line of code - your `base_url` - and we handle model selection, contract compliance, and cost optimization automatically.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
+
## Results
|
| 73 |
|
| 74 |
+
### MMLU Routing Benchmark (60 questions, 6 subjects)
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
+
We benchmarked ModelGate against always routing to GPT-5.4 (default reasoning):
|
| 77 |
|
| 78 |
+
| | GPT-5.4 Direct | ModelGate Router | Delta |
|
| 79 |
+
|---|---|---|---|
|
| 80 |
+
| **Overall Accuracy** | 90% | 85% | -5pp |
|
| 81 |
+
| **Hard Accuracy** | 80% | 80% | 0 |
|
| 82 |
+
| **Cost** | $0.023 | $0.0095 | **-59%** |
|
| 83 |
|
| 84 |
+
The router sent 68% of queries to Gemini Flash Lite, 17% to GPT-4o-mini, and only 15% to GPT-5.4. Hard questions were routed correctly - the cost savings come from not overpaying on easy ones.
|
| 85 |
|
| 86 |
+
Projected at 10k requests/month: **$1.58 vs $3.83** (59% savings).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
+
### Fine-Tuned Classification Model (GRPO Reinforcement Learning)
|
| 89 |
|
| 90 |
+
We fine-tuned ModelGate-Router (based on Arch-Router-1.5B) using GRPO to fix a critical blind spot: the stock model misclassified 86% of medium-complexity queries as complex.
|
| 91 |
|
| 92 |
+
| Tier | Stock | Fine-Tuned | Delta |
|
| 93 |
+
|---|---|---|---|
|
| 94 |
+
| Simple | 87.9% | 81.8% | -6.1pp |
|
| 95 |
+
| **Medium** | **14.3%** | **85.7%** | **+71.4pp** |
|
| 96 |
+
| Complex | 100% | 85.7% | -14.3pp |
|
| 97 |
+
| **Overall** | **70.4%** | **83.3%** | **+13.0pp** |
|
| 98 |
|
| 99 |
+
- **Training:** 2.5 minutes, 150 steps, 172 labeled prompts, LoRA rank 32 (2.3% of params)
|
| 100 |
+
- **Hardware:** RTX 3080 Laptop, 8GB VRAM
|
| 101 |
+
- **Inference:** GGUF Q8_0 quantized to 1.6 GB, runs at **62ms** per classification (3.2x faster than FP16)
|
| 102 |
+
- **Eval:** 54 held-out prompts, zero overlap with training data
|
| 103 |
+
- **Download:** [ModelGate-Router on HuggingFace](https://huggingface.co/AaryanK/ModelGate)
|
| 104 |
+
|
| 105 |
+
## How It Works
|
| 106 |
|
| 107 |
```
|
| 108 |
+
Contract (PDF/text) → LLM extracts constraints → Customer AI Profile → OpenAI-compatible endpoint
|
| 109 |
+
↓
|
| 110 |
+
Prompt received
|
| 111 |
+
↓
|
| 112 |
+
ModelGate-Router classifies
|
| 113 |
+
(simple / medium / complex)
|
| 114 |
+
↓
|
| 115 |
+
Route to optimal model
|
| 116 |
+
per contract constraints
|
|
|
|
| 117 |
```
|
| 118 |
|
| 119 |
+
1. **Upload** a customer contract (SLA, privacy docs, compliance requirements)
|
| 120 |
+
2. **Extract** - an LLM analyzes the contract and produces a structured customer profile (region restrictions, allowed providers, latency targets, cost sensitivity)
|
| 121 |
+
3. **Route** - each request is classified by the fine-tuned 1.5B router (~62ms) and sent to the cheapest model that satisfies all contract constraints
|
| 122 |
+
4. **Monitor** - dashboard shows routing decisions, model distribution, cost savings, and per-request traces
|
| 123 |
|
| 124 |
+
## Architecture
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
+
```
|
| 127 |
+
[Next.js Dashboard :3000] → [FastAPI :8000] → [OpenRouter / Direct APIs]
|
| 128 |
+
↓
|
| 129 |
+
[ModelGate-Router GGUF]
|
| 130 |
+
(llama.cpp, CUDA, ~62ms)
|
| 131 |
+
```
|
| 132 |
|
| 133 |
+
| Component | Stack |
|
| 134 |
+
|---|---|
|
| 135 |
+
| Backend | Python, FastAPI, SQLite |
|
| 136 |
+
| Frontend | Next.js 16, TypeScript, Tailwind CSS, shadcn/ui, Recharts |
|
| 137 |
+
| Classification | ModelGate-Router (fine-tuned), GGUF Q8_0, llama-cpp-python |
|
| 138 |
+
| LLM Inference | OpenRouter (multi-provider: OpenAI, Google, Anthropic, etc.) |
|
| 139 |
+
| Contract Extraction | LLM-powered (GPT-5.4) |
|
| 140 |
|
| 141 |
+
## Quick Start
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
+
### Prerequisites
|
| 144 |
+
- Python 3.12 with PyTorch + CUDA
|
| 145 |
+
- Node.js 18+
|
| 146 |
+
- NVIDIA GPU (for classification model)
|
| 147 |
+
- OpenRouter API key
|
| 148 |
|
| 149 |
+
### Setup
|
| 150 |
|
| 151 |
+
```bash
|
| 152 |
+
git clone https://github.com/Aaryan-Kapoor/ModelGate-Hackathon
|
| 153 |
+
cd ModelGate-Hackathon
|
|
|
|
|
|
|
|
|
|
| 154 |
|
| 155 |
+
# Add your API key
|
| 156 |
+
cp .env.example .env
|
| 157 |
+
# Edit .env with your OPENROUTER_API_KEY
|
|
|
|
|
|
|
| 158 |
|
| 159 |
+
# Run everything
|
| 160 |
+
chmod +x scripts/start.sh
|
| 161 |
+
./scripts/start.sh
|
| 162 |
+
```
|
| 163 |
|
| 164 |
+
Or manually:
|
| 165 |
|
| 166 |
+
```bash
|
| 167 |
+
# Backend
|
| 168 |
+
python3.12 -m venv backend/venv --system-site-packages
|
| 169 |
+
source backend/venv/bin/activate
|
| 170 |
+
pip install -r backend/requirements.txt
|
| 171 |
+
python scripts/seed_data.py
|
| 172 |
+
uvicorn backend.main:app --port 8000
|
| 173 |
+
|
| 174 |
+
# Frontend (separate terminal)
|
| 175 |
+
cd frontend && npm install && npm run dev
|
| 176 |
+
```
|
| 177 |
|
| 178 |
+
### Access
|
| 179 |
+
- Dashboard: http://localhost:3000
|
| 180 |
+
- API Docs: http://localhost:8000/docs
|
| 181 |
+
- Proxy endpoint: `POST http://localhost:8000/v1/{customer_id}/chat/completions`
|
| 182 |
|
| 183 |
+
## Benchmarking
|
| 184 |
|
| 185 |
+
```bash
|
| 186 |
+
# Run MMLU benchmark against any OpenAI-compatible endpoint
|
| 187 |
+
python scripts/bench_mmlu.py run \
|
| 188 |
+
--base-url http://localhost:8000/v1 \
|
| 189 |
+
--api-key dummy \
|
| 190 |
+
--model auto \
|
| 191 |
+
--label router
|
| 192 |
|
| 193 |
+
# Compare two runs
|
| 194 |
+
python scripts/bench_mmlu.py compare results/run_a.json results/run_b.json
|
| 195 |
|
| 196 |
+
# Benchmark the classification model (GGUF)
|
| 197 |
+
python finetuning/bench_gguf.py
|
| 198 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
|
| 200 |
+
## Fine-Tuning
|
| 201 |
|
| 202 |
+
The fine-tuning pipeline lives in `finetuning/`. See [`finetuning/README.md`](finetuning/README.md) for full details.
|
| 203 |
|
| 204 |
```bash
|
| 205 |
+
# Train (2.5 min on RTX 3080 8GB)
|
| 206 |
python finetuning/grpo_run_nocot.py
|
|
|
|
|
|
|
| 207 |
|
| 208 |
+
# Export to GGUF
|
|
|
|
|
|
|
| 209 |
python finetuning/export_gguf.py nocot
|
|
|
|
|
|
|
| 210 |
|
| 211 |
+
# Benchmark stock vs fine-tuned
|
|
|
|
|
|
|
|
|
|
| 212 |
python finetuning/bench_gguf.py
|
|
|
|
|
|
|
|
|
|
| 213 |
```
|
| 214 |
|
| 215 |
+
## Project Structure
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 216 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 217 |
```
|
| 218 |
+
backend/
|
| 219 |
+
main.py # FastAPI app
|
| 220 |
+
services/
|
| 221 |
+
classifier.py # ModelGate-Router inference (llama.cpp)
|
| 222 |
+
extractor.py # Contract → Customer AI Profile (LLM)
|
| 223 |
+
router_engine.py # Model scoring and selection
|
| 224 |
+
provider_registry.py # Model catalog with pricing/capabilities
|
| 225 |
+
frontend/ # Next.js dashboard
|
| 226 |
+
finetuning/
|
| 227 |
+
grpo_run_nocot.py # GRPO training script
|
| 228 |
+
grpo_training_data.json # 172 labeled training prompts
|
| 229 |
+
grpo_eval_data.json # 54 held-out eval prompts
|
| 230 |
+
export_gguf.py # LoRA merge + GGUF conversion
|
| 231 |
+
bench_gguf.py # GGUF benchmark (accuracy + latency)
|
| 232 |
+
ModelGate-Router.Q8_0.gguf # Production model (1.6 GB)
|
| 233 |
+
scripts/
|
| 234 |
+
bench_mmlu.py # MMLU benchmark runner
|
| 235 |
+
mmlu_questions.json # 60 real MMLU questions from HuggingFace
|
| 236 |
+
start.sh # One-command startup
|
| 237 |
+
```
|
|
|
|
|
|