--- language: - en license: mit library_name: llama-cpp-python base_model: katanemo/Arch-Router-1.5B tags: - routing - grpo - reinforcement-learning - gguf - lora - unsloth - trl - qwen2 - llama-cpp - contract-aware - cost-optimization - query-classification model_type: qwen2 pipeline_tag: text-classification datasets: - custom metrics: - accuracy model-index: - name: ModelGate-Router results: - task: type: text-classification name: Query Complexity Classification metrics: - type: accuracy value: 83.3 name: Overall Accuracy (held-out, GGUF Q8_0) - type: accuracy value: 85.7 name: Medium Tier Accuracy - type: latency value: 62 name: Avg Latency (ms, CUDA) ---

ModelGate Banner

# ModelGate **Intelligent AI Routing - Built from Your Contracts** One line of code changed. Millions of premium calls rerouted. ModelGate is a contract-aware AI control plane that ingests customer contracts, extracts SLA/privacy/routing constraints, and generates an OpenAI-compatible endpoint that automatically routes every request to the optimal model. Simple queries go to cheap models. Complex queries go to premium ones. Contract compliance is enforced per request, automatically. **3rd Place** at the KSU Social Good Hackathon 2026 - Assurant Track. ### Team Agents Assemble | | Role | |---|---| | **[Aaryan Kapoor](https://www.linkedin.com/in/theaaryankapoor/)** | Lead Architect & AI Engineer | | **[Pradyumna Kumar](https://www.linkedin.com/in/pradyum-kumar/)** | Platform Architect & Frontend | | **[Danny Tran](https://www.linkedin.com/in/nam-tr%E1%BA%A7n-02973b2b6/)** | Design & Presentation Lead | ## Why Over 30 new LLMs launched in the past month alone. No team has time to evaluate them all - so they pick one premium model and send everything to it. The result: 50-90% of enterprise AI spend is wasted on over-provisioned models, and premium models consume 180x more energy per query than small ones. ModelGate fixes this. You change one line of code - your `base_url` - and we handle model selection, contract compliance, and cost optimization automatically. ## Results ### MMLU Routing Benchmark (60 questions, 6 subjects) We benchmarked ModelGate against always routing to GPT-5.4 (default reasoning): | | GPT-5.4 Direct | ModelGate Router | Delta | |---|---|---|---| | **Overall Accuracy** | 90% | 85% | -5pp | | **Hard Accuracy** | 80% | 80% | 0 | | **Cost** | $0.023 | $0.0095 | **-59%** | The router sent 68% of queries to Gemini Flash Lite, 17% to GPT-4o-mini, and only 15% to GPT-5.4. Hard questions were routed correctly - the cost savings come from not overpaying on easy ones. Projected at 10k requests/month: **$1.58 vs $3.83** (59% savings). ### Fine-Tuned Classification Model (GRPO Reinforcement Learning) We fine-tuned ModelGate-Router (based on Arch-Router-1.5B) using GRPO to fix a critical blind spot: the stock model misclassified 86% of medium-complexity queries as complex. | Tier | Stock | Fine-Tuned | Delta | |---|---|---|---| | Simple | 87.9% | 81.8% | -6.1pp | | **Medium** | **14.3%** | **85.7%** | **+71.4pp** | | Complex | 100% | 85.7% | -14.3pp | | **Overall** | **70.4%** | **83.3%** | **+13.0pp** | - **Training:** 2.5 minutes, 150 steps, 172 labeled prompts, LoRA rank 32 (2.3% of params) - **Hardware:** RTX 3080 Laptop, 8GB VRAM - **Inference:** GGUF Q8_0 quantized to 1.6 GB, runs at **62ms** per classification (3.2x faster than FP16) - **Eval:** 54 held-out prompts, zero overlap with training data - **Download:** [ModelGate-Router on HuggingFace](https://huggingface.co/AaryanK/ModelGate) ## Screenshots ### Platform Dashboard Real-time monitoring of routing decisions, cost savings, model distribution, and request volume.

ModelGate Dashboard

### Model Registry Browse the OpenRouter catalog and toggle models on/off with one click. Configure which models power your routing.

Model Registry

### Customer Onboarding Upload a contract, review the AI-extracted profile, and start routing - all in under 30 seconds.

Customer Onboarding

## How It Works ``` Contract (PDF/text) → LLM extracts constraints → Customer AI Profile → OpenAI-compatible endpoint ↓ Prompt received ↓ ModelGate-Router classifies (simple / medium / complex) ↓ Route to optimal model per contract constraints ``` 1. **Upload** a customer contract (SLA, privacy docs, compliance requirements) 2. **Extract** - an LLM analyzes the contract and produces a structured customer profile (region restrictions, allowed providers, latency targets, cost sensitivity) 3. **Route** - each request is classified by the fine-tuned 1.5B router (~62ms) and sent to the cheapest model that satisfies all contract constraints 4. **Monitor** - dashboard shows routing decisions, model distribution, cost savings, and per-request traces ## Architecture ``` [Next.js Dashboard :3000] → [FastAPI :8000] → [OpenRouter / Direct APIs] ↓ [ModelGate-Router GGUF] (llama.cpp, CUDA, ~62ms) ``` | Component | Stack | |---|---| | Backend | Python, FastAPI, SQLite | | Frontend | Next.js 16, TypeScript, Tailwind CSS, shadcn/ui, Recharts | | Classification | ModelGate-Router (fine-tuned), GGUF Q8_0, llama-cpp-python | | LLM Inference | OpenRouter (multi-provider: OpenAI, Google, Anthropic, etc.) | | Contract Extraction | LLM-powered (GPT-5.4) | ## Quick Start ### Prerequisites - Python 3.12 with PyTorch + CUDA - Node.js 18+ - NVIDIA GPU (for classification model) - OpenRouter API key ### Setup ```bash git clone https://github.com/Aaryan-Kapoor/ModelGate-Hackathon cd ModelGate-Hackathon # Add your API key cp .env.example .env # Edit .env with your OPENROUTER_API_KEY # Run everything chmod +x scripts/start.sh ./scripts/start.sh ``` Or manually: ```bash # Backend python3.12 -m venv backend/venv --system-site-packages source backend/venv/bin/activate pip install -r backend/requirements.txt python scripts/seed_data.py uvicorn backend.main:app --port 8000 # Frontend (separate terminal) cd frontend && npm install && npm run dev ``` ### Access - Dashboard: http://localhost:3000 - API Docs: http://localhost:8000/docs - Proxy endpoint: `POST http://localhost:8000/v1/{customer_id}/chat/completions` ## Benchmarking ```bash # Run MMLU benchmark against any OpenAI-compatible endpoint python scripts/bench_mmlu.py run \ --base-url http://localhost:8000/v1 \ --api-key dummy \ --model auto \ --label router # Compare two runs python scripts/bench_mmlu.py compare results/run_a.json results/run_b.json # Benchmark the classification model (GGUF) python finetuning/bench_gguf.py ``` ## Fine-Tuning The fine-tuning pipeline lives in `finetuning/`. See [`finetuning/README.md`](finetuning/README.md) for full details. ```bash # Train (2.5 min on RTX 3080 8GB) python finetuning/grpo_run_nocot.py # Export to GGUF python finetuning/export_gguf.py nocot # Benchmark stock vs fine-tuned python finetuning/bench_gguf.py ``` ## Project Structure ``` backend/ main.py # FastAPI app services/ classifier.py # ModelGate-Router inference (llama.cpp) extractor.py # Contract → Customer AI Profile (LLM) router_engine.py # Model scoring and selection provider_registry.py # Model catalog with pricing/capabilities frontend/ # Next.js dashboard finetuning/ grpo_run_nocot.py # GRPO training script grpo_training_data.json # 172 labeled training prompts grpo_eval_data.json # 54 held-out eval prompts export_gguf.py # LoRA merge + GGUF conversion bench_gguf.py # GGUF benchmark (accuracy + latency) ModelGate-Router.Q8_0.gguf # Production model (1.6 GB) scripts/ bench_mmlu.py # MMLU benchmark runner mmlu_questions.json # 60 real MMLU questions from HuggingFace start.sh # One-command startup ```