Spaces:

llm-semantic-router
/

README

Running

App Files Files Community

Xunzhuo commited on Oct 17

Commit

032b7e8

verified ·

1 Parent(s): 3d2cb9d

Update README.md

Browse files

Files changed (1) hide show

README.md +36 -53

README.md CHANGED Viewed

@@ -4,71 +4,54 @@ emoji: 📊
 colorFrom: blue
 colorTo: blue
 sdk: static
-pinned: false
 ---
-# vLLM Semantic Router
-[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
-An **Mixture-of-Models (MoM)** router that intelligently directs OpenAI API requests to the most suitable models from a defined pool based on **Semantic Understanding** of the request's intent.
-This is achieved using BERT classification. Conceptually similar to Mixture-of-Experts (MoE) which lives *within* a model, this system selects the best *entire model* for the nature of the task.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/AM-CIx1BGnRJkifzOcCHe.png)
-## 🚀 Key Features
-### 🎯 **Auto-selection of Models**
-Intelligently routes requests to specialized models based on semantic understanding:
-- **Math queries** → Math-specialized models
-- **Creative writing** → Creative-specialized models
-- **Code generation** → Code-specialized models
-- **General queries** → Balanced general-purpose models
-### 🛡️ **Security & Privacy**
-- **PII Detection**: Automatically detects and handles personally identifiable information
-- **Prompt Guard**: Identifies and blocks jailbreak attempts
-- **Safe Routing**: Ensures sensitive prompts are handled appropriately
-### ⚡ **Performance Optimization**
-- **Semantic Cache**: Caches semantic representations to reduce latency
-- **Tool Selection**: Auto-selects relevant tools to reduce token usage and improve tool selection accuracy
-### 🏗️ **Architecture**
-- **Envoy ExtProc Integration**: Seamlessly integrates with Envoy proxy
-- **Dual Implementation**: Available in both Go (with Rust FFI) and Python
-- **Scalable Design**: Production-ready with comprehensive monitoring
-## 📊 Performance Benefits
-Our testing shows significant improvements in model accuracy through specialized routing.
-![image/webp](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/efbREtUgJWTsU3iu5Xhu9.webp)
-## 🛠️ Architecture Overview
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/jBZuH9Uy-lsVfGel5p5FT.png)
-## 🎯 Use Cases
-- **Enterprise API Gateways**: Route different types of queries to cost-optimized models
-- **Multi-tenant Platforms**: Provide specialized routing for different customer needs
-- **Development Environments**: Balance cost and performance for different workloads
-- **Production Services**: Ensure optimal model selection with built-in safety measures
-## 📈 Monitoring & Observability
-The router provides comprehensive monitoring through:
-- **Grafana Dashboard**: Real-time metrics and performance tracking
-- **Prometheus Metrics**: Detailed routing statistics and performance data
-- **Request Tracing**: Full visibility into routing decisions and performance
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/ZfofBg68tHlXaHEz2arCh.png)
-## 📖 Documentation
-For comprehensive documentation including detailed setup instructions, architecture guides, and API references, visit:
-**👉 [Complete Documentation at Read the Docs](https://vllm-semantic-router.com/)**

 colorFrom: blue
 colorTo: blue
 sdk: static
+pinned: true
+license: apache-2.0
+thumbnail: >-
+  https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/Nwp5bcZfu_D51MUNCN3oO.png
+short_description: 'MoM: Specialized Models for Intelligent Routing'
 ---
+![mom-family](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/M9vyenphR9xlPPfSOJyOh.png)
+**One fabric. Many minds.** We're introducing **MoM** (Mixture of Models)—a family of specialized routing models that power vLLM-SR's intelligent decision-making.
++ vLLM Semantic Router 👉: [project link](https://github.com/vllm-project/semantic-router)
+<!-- truncate -->
+## Why MoM?
+vLLM-SR solves a critical problem: **how to route LLM requests to the right model at the right time**. Not every query needs the same resources—"What's the weather?" shouldn't cost as much as "Analyze this legal contract."
+![a6d3ff18-c3e8-4e7e-9545-a95a5d525b89](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/k7yjJaG6CqXCMnuiQOXyZ.png)
+## MoM System Card
+A quick overview of all MoM models:
+<div align="center">
+| Category | Model | Size | Architecture | Base Model | Purpose |
+|----------|-------|------|--------------|------------|---------|
+| **🧠 Intelligent Routing** | mom-brain-flash | Flash | Encoder | ModernBERT | Ultra-fast intent classification |
+| | mom-brain-pro | Pro | Decoder | Qwen3 0.6B | Balanced routing with reasoning |
+| | mom-brain-max | Max | Decoder | Qwen3 1.7B | Maximum accuracy for complex decisions |
+| **🔍 Similarity Search** | mom-similarity-flash | Flash | Encoder | ModernBERT | Semantic similarity matching |
+| **🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | Encoder | ModernBERT | Jailbreak/attack detection |
+| | mom-pii-flash | Flash | Encoder | ModernBERT | PII detection & privacy protection |
+| **🎯 SLM Experts** | mom-expert-math-flash | Flash | Decoder | Qwen3 0.6B | Backend math problem solver |
+| | mom-expert-science-flash | Flash | Decoder | Qwen3 0.6B | Backend science problem solver |
+| | mom-expert-social-flash | Flash | Decoder | Qwen3 0.6B | Backend social sciences solver |
+| | mom-expert-humanities-flash | Flash | Decoder | Qwen3 0.6B | Backend humanities solver |
+| | mom-expert-law-flash | Flash | Decoder | Qwen3 0.6B | Backend law problem solver |
+| | mom-expert-generalist-flash | Flash | Decoder | Qwen3 0.6B | Backend generalist solver |
+</div>
+**Key Insights:**
+- **4 Categories**: 3 for routing (Intelligent Routing, Similarity Search, Prompt Guardian) + 1 for backend problem solving (SLM Experts)
+- **ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput routing
+- **Qwen3** (decoder-only) → Explainable routing decisions + domain-specific problem solving
+- **Flash** models achieve 10,000+ QPS on commodity hardware
+- **SLM Experts** are not routers—they are specialized backend models that solve domain-specific problems