AaryanK commited on
Commit
7ad0908
·
verified ·
1 Parent(s): 230cc00

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +141 -173
README.md CHANGED
@@ -41,229 +41,197 @@ model-index:
41
  name: Avg Latency (ms, CUDA)
42
  ---
43
 
44
- # ModelGate-Router
 
 
45
 
46
- GRPO (Group Relative Policy Optimization) fine-tuned routing model for ModelGate's contract-aware query routing. Based on Arch-Router-1.5B. Classifies incoming queries as **simple**, **medium**, or **complex** to route them to the right model tier.
47
 
48
- ## Results
49
 
50
- ### Accuracy Held-Out Eval (54 unseen prompts, zero training overlap)
51
 
52
- | Tier | Stock Arch-Router | ModelGate-Router | Improvement |
53
- |------|-------------------|---------------|-------------|
54
- | Simple (33) | 87.9% | 81.8% | -6.1% |
55
- | **Medium (14)** | **14.3%** | **85.7%** | **+71.4%** |
56
- | Complex (7) | 100.0% | 85.7% | -14.3% |
57
- | **Overall (54)** | **70.4%** | **83.3%** | **+13.0%** |
58
 
59
- The stock model misclassifies **86% of medium queries** as complex routing them to expensive premium models when a mid-tier model would suffice. ModelGate-Router fixes this.
60
 
61
- ### Latency GGUF Q8_0 + CUDA (RTX 3080 Laptop)
62
 
63
- | Metric | Stock | ModelGate-Router | Delta |
64
- |--------|-------|---------------|-------|
65
- | Avg | 62.6ms | 61.7ms | -0.9ms |
66
- | P50 | 61.3ms | 60.5ms | -0.8ms |
67
- | P95 | 67.8ms | 67.3ms | -0.5ms |
68
 
69
- **Zero latency overhead.** ModelGate-Router is actually marginally faster.
70
 
71
- ### Latency by Inference Backend
72
 
73
- | Backend | Model Size | Avg Latency | vs Transformers FP16 |
74
- |---------|-----------|-------------|---------------------|
75
- | Transformers FP16 | ~3.0 GB | 196ms | baseline |
76
- | GGUF Q8_0 (CPU) | 1.6 GB | 768ms | 3.9x slower |
77
- | **GGUF Q8_0 (CUDA)** | **1.6 GB** | **62ms** | **3.2x faster** |
78
 
79
- ### Quantization Impact on Accuracy
80
 
81
- | Backend | Stock Accuracy | ModelGate-Router Accuracy |
82
- |---------|---------------|-------------------|
83
- | Transformers FP16 | 72.2% | 81.5% |
84
- | GGUF Q8_0 | 70.4% | 83.3% |
85
 
86
- Q8_0 quantization causes **no meaningful accuracy degradation**.
87
 
88
- ### Chain-of-Thought vs Direct Output
 
 
 
 
89
 
90
- We trained two variants one with chain-of-thought reasoning (`<reasoning>` tags before the answer) and one with direct JSON output. Results on held-out data:
91
 
92
- | Variant | Accuracy | Avg Latency | Verdict |
93
- |---------|----------|-------------|---------|
94
- | Stock | 72.2% | 196ms | Baseline |
95
- | **ModelGate-Router (No-CoT)** | **81.5%** | **198ms** | **Best tradeoff** |
96
- | ModelGate-Router (CoT) | 61.1% | 1,787ms | Overfit, 9x slower |
97
 
98
- The CoT variant actually hurt accuracy on unseen data (overfit to training format) and added ~1.6s of latency per classification. The No-CoT variant is the clear winner.
99
 
100
- ## Why Fine-Tune?
101
 
102
- The stock `katanemo/Arch-Router-1.5B` was trained for general-purpose intent routing. Our use case is specific: classify queries across customer support, insurance claims, and device protection into three complexity tiers. The stock model has a critical blind spot — it routes nearly all medium-complexity queries to the complex tier, wasting money on premium models.
 
 
 
 
 
103
 
104
- ## Architecture
 
 
 
 
 
 
105
 
106
  ```
107
- Qwen/Qwen2.5-1.5B-Instruct (base LLM, 1.5B params)
108
- |
109
- v Katanemo fine-tune
110
- katanemo/Arch-Router-1.5B (general intent routing)
111
- |
112
- v GRPO fine-tune (2.3% of params, LoRA rank 32)
113
- ModelGate-Router (domain-specific complexity routing)
114
- |
115
- v GGUF Q8_0 quantization
116
- ModelGate-Router.Q8_0.gguf (1.6 GB, production-ready)
117
  ```
118
 
119
- ## Files
120
-
121
- ### Model Weights
 
122
 
123
- | File | Size | Description |
124
- |------|------|-------------|
125
- | `ModelGate-Router.Q8_0.gguf` | 1.6 GB | ModelGate-Router — GGUF Q8_0, deploy with llama.cpp |
126
- | `stock_arch_router.Q8_0.gguf` | 1.6 GB | Stock Arch-Router in GGUF Q8_0, for comparison |
127
- | `ModelGate-Router-LoRA/` | 157 MB | ModelGate-Router LoRA adapter (best model) |
128
- | `modelgate_arch_router_lora/` | 157 MB | CoT LoRA adapter (for reference only) |
129
-
130
- ### Data
131
 
132
- | File | Description |
133
- |------|-------------|
134
- | `grpo_training_data.json` | 172 labeled training prompts across 4 domains |
135
- | `grpo_eval_data.json` | 54 held-out eval prompts (zero overlap with training) |
 
 
136
 
137
- ### Scripts
 
 
 
 
 
 
138
 
139
- | File | Description |
140
- |------|-------------|
141
- | `grpo_finetune_arch_router.ipynb` | CoT training notebook (Colab/local) |
142
- | `grpo_run_nocot.py` | No-CoT training script (the one that produced the best model) |
143
- | `export_gguf.py` | Merges LoRA + converts to GGUF Q8_0 |
144
- | `bench_gguf.py` | Benchmarks GGUF models via llama.cpp (accuracy + latency) |
145
- | `bench_stock_vs_finetune.py` | Benchmarks via Transformers (FP16, 3-way comparison) |
146
 
147
- ## Training Data
 
 
 
 
148
 
149
- **172 training examples** across 4 domains and 3 tiers:
150
 
151
- | Domain | Count |
152
- |--------|-------|
153
- | customer_support | 51 |
154
- | insurance_claims | 46 |
155
- | device_protection | 37 |
156
- | general | 38 |
157
 
158
- | Tier | Count | Examples |
159
- |------|-------|----------|
160
- | simple | 95 | "What is your return policy?", "Is my claim approved?" |
161
- | medium | 51 | "Compare the protection plans available for my new laptop..." |
162
- | complex | 26 | "Analyze the multi-party liability exposure across claims..." |
163
 
164
- **54 eval examples** — completely separate prompts, same domain/tier distribution, zero overlap with training data.
 
 
 
165
 
166
- ## How GRPO Training Works
167
 
168
- Unlike supervised fine-tuning where you provide input-output pairs, GRPO:
 
 
 
 
 
 
 
 
 
 
169
 
170
- 1. **Generates** multiple candidate completions per prompt
171
- 2. **Scores** each with reward functions
172
- 3. **Reinforces** the best completions relative to the group
 
173
 
174
- ### No-CoT Reward Functions
175
 
176
- | Function | Max Score | Purpose |
177
- |----------|-----------|---------|
178
- | `correctness_reward_func` | 2.0 | Route matches ground truth |
179
- | `valid_route_reward_func` | 0.5 | Output is a valid tier name |
180
- | `json_format_reward_func` | 1.0 | Output is clean JSON with "route" key |
181
- | `brevity_reward_func` | 0.5 | Rewards short outputs (just the JSON) |
 
182
 
183
- ## Training Details
 
184
 
185
- | Parameter | Value |
186
- |-----------|-------|
187
- | Base model | `katanemo/Arch-Router-1.5B` |
188
- | Method | GRPO via Unsloth + TRL |
189
- | LoRA rank | 32 |
190
- | Trainable params | 36.9M / 1.58B (2.3%) |
191
- | Training steps | 150 |
192
- | Training time | **2.5 minutes** |
193
- | Hardware | RTX 3080 Laptop 8GB |
194
- | VRAM usage | ~6 GB (4-bit quantized during training) |
195
- | Generations per prompt | 4 |
196
- | Learning rate | 5e-6 |
197
- | Max completion length | 64 tokens |
198
 
199
- ## How to Reproduce
200
 
201
- ### Train the No-CoT Model
202
 
203
  ```bash
204
- # Requires: pip install unsloth vllm trl
205
  python finetuning/grpo_run_nocot.py
206
- # Output: ModelGate-Router-LoRA/
207
- ```
208
 
209
- ### Export to GGUF
210
-
211
- ```bash
212
  python finetuning/export_gguf.py nocot
213
- # Output: finetuning/ModelGate-Router.Q8_0.gguf
214
- ```
215
 
216
- ### Benchmark
217
-
218
- ```bash
219
- # GGUF benchmark (requires llama-cpp-python with CUDA)
220
  python finetuning/bench_gguf.py
221
-
222
- # Transformers FP16 benchmark (3-way: stock vs no-CoT vs CoT)
223
- python finetuning/bench_stock_vs_finetune.py
224
  ```
225
 
226
- ## Production Deployment
227
-
228
- The recommended deployment uses `ModelGate-Router.Q8_0.gguf` with llama.cpp:
229
-
230
- ```python
231
- from llama_cpp import Llama
232
-
233
- model = Llama(
234
- model_path="finetuning/ModelGate-Router.Q8_0.gguf",
235
- n_ctx=512,
236
- n_gpu_layers=-1, # All layers on GPU
237
- )
238
 
239
- # Classify a query
240
- response = model.create_chat_completion(
241
- messages=[{"role": "user", "content": routing_prompt}],
242
- max_tokens=30,
243
- temperature=0,
244
- )
245
- route = json.loads(response["choices"][0]["message"]["content"])["route"]
246
- # route is "simple", "medium", or "complex"
247
  ```
248
-
249
- **Expected performance**: ~62ms per classification, 83%+ accuracy, 1.6 GB VRAM.
250
-
251
- ## Route Policies
252
-
253
- The three tiers the model classifies into (defined in `backend/services/classifier.py`):
254
-
255
- | Tier | Description | Model Tier | Cost |
256
- |------|-------------|-----------|------|
257
- | simple | FAQs, status checks, basic lookups | gpt-4o-mini, gemini-flash | $0.10-0.60/M tokens |
258
- | medium | Multi-step reasoning, comparisons, troubleshooting | gpt-4o, claude-sonnet | $2.50-15.00/M tokens |
259
- | complex | Multi-document analysis, legal/financial reasoning | gemini-2.5-pro, claude-sonnet | $2.50-15.00/M tokens |
260
-
261
- Correctly routing simple queries to cheap models instead of premium ones is the core value proposition. The stock model's 14% medium accuracy means it wastes money routing mid-tier queries to expensive models. ModelGate-Router's 86% medium accuracy captures those savings.
262
-
263
- ## References
264
-
265
- - [Arch-Router-1.5B](https://huggingface.co/katanemo/Arch-Router-1.5B) base model
266
- - [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) — foundation model
267
- - [Unsloth](https://github.com/unslothai/unsloth) — training framework
268
- - [TRL GRPOTrainer](https://huggingface.co/docs/trl/main/en/grpo_trainer) — GRPO implementation
269
- - [llama.cpp](https://github.com/ggerganov/llama.cpp) — GGUF inference engine
 
41
  name: Avg Latency (ms, CUDA)
42
  ---
43
 
44
+ <p align="center">
45
+ <img src="banner.svg" alt="ModelGate Banner" width="100%"/>
46
+ </p>
47
 
48
+ # ModelGate
49
 
50
+ **Intelligent AI Routing - Built from Your Contracts**
51
 
52
+ One line of code changed. Millions of premium calls rerouted.
53
 
54
+ ModelGate is a contract-aware AI control plane that ingests customer contracts, extracts SLA/privacy/routing constraints, and generates an OpenAI-compatible endpoint that automatically routes every request to the optimal model. Simple queries go to cheap models. Complex queries go to premium ones. Contract compliance is enforced per request, automatically.
 
 
 
 
 
55
 
56
+ **3rd Place** at the KSU Social Good Hackathon 2026 - Assurant Track.
57
 
58
+ ### Team Agents Assemble
59
 
60
+ | | Role |
61
+ |---|---|
62
+ | **[Aaryan Kapoor](https://www.linkedin.com/in/theaaryankapoor/)** | Lead Architect & AI Engineer |
63
+ | **[Pradyumna Kumar](https://www.linkedin.com/in/pradyum-kumar/)** | Platform Architect & Frontend |
64
+ | **[Danny Tran](https://www.linkedin.com/in/nam-tr%E1%BA%A7n-02973b2b6/)** | Design & Presentation Lead |
65
 
66
+ ## Why
67
 
68
+ Over 30 new LLMs launched in the past month alone. No team has time to evaluate them all - so they pick one premium model and send everything to it. The result: 50-90% of enterprise AI spend is wasted on over-provisioned models, and premium models consume 180x more energy per query than small ones.
69
 
70
+ ModelGate fixes this. You change one line of code - your `base_url` - and we handle model selection, contract compliance, and cost optimization automatically.
 
 
 
 
71
 
72
+ ## Results
73
 
74
+ ### MMLU Routing Benchmark (60 questions, 6 subjects)
 
 
 
75
 
76
+ We benchmarked ModelGate against always routing to GPT-5.4 (default reasoning):
77
 
78
+ | | GPT-5.4 Direct | ModelGate Router | Delta |
79
+ |---|---|---|---|
80
+ | **Overall Accuracy** | 90% | 85% | -5pp |
81
+ | **Hard Accuracy** | 80% | 80% | 0 |
82
+ | **Cost** | $0.023 | $0.0095 | **-59%** |
83
 
84
+ The router sent 68% of queries to Gemini Flash Lite, 17% to GPT-4o-mini, and only 15% to GPT-5.4. Hard questions were routed correctly - the cost savings come from not overpaying on easy ones.
85
 
86
+ Projected at 10k requests/month: **$1.58 vs $3.83** (59% savings).
 
 
 
 
87
 
88
+ ### Fine-Tuned Classification Model (GRPO Reinforcement Learning)
89
 
90
+ We fine-tuned ModelGate-Router (based on Arch-Router-1.5B) using GRPO to fix a critical blind spot: the stock model misclassified 86% of medium-complexity queries as complex.
91
 
92
+ | Tier | Stock | Fine-Tuned | Delta |
93
+ |---|---|---|---|
94
+ | Simple | 87.9% | 81.8% | -6.1pp |
95
+ | **Medium** | **14.3%** | **85.7%** | **+71.4pp** |
96
+ | Complex | 100% | 85.7% | -14.3pp |
97
+ | **Overall** | **70.4%** | **83.3%** | **+13.0pp** |
98
 
99
+ - **Training:** 2.5 minutes, 150 steps, 172 labeled prompts, LoRA rank 32 (2.3% of params)
100
+ - **Hardware:** RTX 3080 Laptop, 8GB VRAM
101
+ - **Inference:** GGUF Q8_0 quantized to 1.6 GB, runs at **62ms** per classification (3.2x faster than FP16)
102
+ - **Eval:** 54 held-out prompts, zero overlap with training data
103
+ - **Download:** [ModelGate-Router on HuggingFace](https://huggingface.co/AaryanK/ModelGate)
104
+
105
+ ## How It Works
106
 
107
  ```
108
+ Contract (PDF/text) LLM extracts constraints → Customer AI Profile → OpenAI-compatible endpoint
109
+
110
+ Prompt received
111
+
112
+ ModelGate-Router classifies
113
+ (simple / medium / complex)
114
+
115
+ Route to optimal model
116
+ per contract constraints
 
117
  ```
118
 
119
+ 1. **Upload** a customer contract (SLA, privacy docs, compliance requirements)
120
+ 2. **Extract** - an LLM analyzes the contract and produces a structured customer profile (region restrictions, allowed providers, latency targets, cost sensitivity)
121
+ 3. **Route** - each request is classified by the fine-tuned 1.5B router (~62ms) and sent to the cheapest model that satisfies all contract constraints
122
+ 4. **Monitor** - dashboard shows routing decisions, model distribution, cost savings, and per-request traces
123
 
124
+ ## Architecture
 
 
 
 
 
 
 
125
 
126
+ ```
127
+ [Next.js Dashboard :3000] → [FastAPI :8000] → [OpenRouter / Direct APIs]
128
+
129
+ [ModelGate-Router GGUF]
130
+ (llama.cpp, CUDA, ~62ms)
131
+ ```
132
 
133
+ | Component | Stack |
134
+ |---|---|
135
+ | Backend | Python, FastAPI, SQLite |
136
+ | Frontend | Next.js 16, TypeScript, Tailwind CSS, shadcn/ui, Recharts |
137
+ | Classification | ModelGate-Router (fine-tuned), GGUF Q8_0, llama-cpp-python |
138
+ | LLM Inference | OpenRouter (multi-provider: OpenAI, Google, Anthropic, etc.) |
139
+ | Contract Extraction | LLM-powered (GPT-5.4) |
140
 
141
+ ## Quick Start
 
 
 
 
 
 
142
 
143
+ ### Prerequisites
144
+ - Python 3.12 with PyTorch + CUDA
145
+ - Node.js 18+
146
+ - NVIDIA GPU (for classification model)
147
+ - OpenRouter API key
148
 
149
+ ### Setup
150
 
151
+ ```bash
152
+ git clone https://github.com/Aaryan-Kapoor/ModelGate-Hackathon
153
+ cd ModelGate-Hackathon
 
 
 
154
 
155
+ # Add your API key
156
+ cp .env.example .env
157
+ # Edit .env with your OPENROUTER_API_KEY
 
 
158
 
159
+ # Run everything
160
+ chmod +x scripts/start.sh
161
+ ./scripts/start.sh
162
+ ```
163
 
164
+ Or manually:
165
 
166
+ ```bash
167
+ # Backend
168
+ python3.12 -m venv backend/venv --system-site-packages
169
+ source backend/venv/bin/activate
170
+ pip install -r backend/requirements.txt
171
+ python scripts/seed_data.py
172
+ uvicorn backend.main:app --port 8000
173
+
174
+ # Frontend (separate terminal)
175
+ cd frontend && npm install && npm run dev
176
+ ```
177
 
178
+ ### Access
179
+ - Dashboard: http://localhost:3000
180
+ - API Docs: http://localhost:8000/docs
181
+ - Proxy endpoint: `POST http://localhost:8000/v1/{customer_id}/chat/completions`
182
 
183
+ ## Benchmarking
184
 
185
+ ```bash
186
+ # Run MMLU benchmark against any OpenAI-compatible endpoint
187
+ python scripts/bench_mmlu.py run \
188
+ --base-url http://localhost:8000/v1 \
189
+ --api-key dummy \
190
+ --model auto \
191
+ --label router
192
 
193
+ # Compare two runs
194
+ python scripts/bench_mmlu.py compare results/run_a.json results/run_b.json
195
 
196
+ # Benchmark the classification model (GGUF)
197
+ python finetuning/bench_gguf.py
198
+ ```
 
 
 
 
 
 
 
 
 
 
199
 
200
+ ## Fine-Tuning
201
 
202
+ The fine-tuning pipeline lives in `finetuning/`. See [`finetuning/README.md`](finetuning/README.md) for full details.
203
 
204
  ```bash
205
+ # Train (2.5 min on RTX 3080 8GB)
206
  python finetuning/grpo_run_nocot.py
 
 
207
 
208
+ # Export to GGUF
 
 
209
  python finetuning/export_gguf.py nocot
 
 
210
 
211
+ # Benchmark stock vs fine-tuned
 
 
 
212
  python finetuning/bench_gguf.py
 
 
 
213
  ```
214
 
215
+ ## Project Structure
 
 
 
 
 
 
 
 
 
 
 
216
 
 
 
 
 
 
 
 
 
217
  ```
218
+ backend/
219
+ main.py # FastAPI app
220
+ services/
221
+ classifier.py # ModelGate-Router inference (llama.cpp)
222
+ extractor.py # Contract → Customer AI Profile (LLM)
223
+ router_engine.py # Model scoring and selection
224
+ provider_registry.py # Model catalog with pricing/capabilities
225
+ frontend/ # Next.js dashboard
226
+ finetuning/
227
+ grpo_run_nocot.py # GRPO training script
228
+ grpo_training_data.json # 172 labeled training prompts
229
+ grpo_eval_data.json # 54 held-out eval prompts
230
+ export_gguf.py # LoRA merge + GGUF conversion
231
+ bench_gguf.py # GGUF benchmark (accuracy + latency)
232
+ ModelGate-Router.Q8_0.gguf # Production model (1.6 GB)
233
+ scripts/
234
+ bench_mmlu.py # MMLU benchmark runner
235
+ mmlu_questions.json # 60 real MMLU questions from HuggingFace
236
+ start.sh # One-command startup
237
+ ```