Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -4,71 +4,54 @@ emoji: π
|
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: blue
|
| 6 |
sdk: static
|
| 7 |
-
pinned:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
| 23 |
-
Intelligently routes requests to specialized models based on semantic understanding:
|
| 24 |
-
- **Math queries** β Math-specialized models
|
| 25 |
-
- **Creative writing** β Creative-specialized models
|
| 26 |
-
- **Code generation** β Code-specialized models
|
| 27 |
-
- **General queries** β Balanced general-purpose models
|
| 28 |
|
| 29 |
-
|
| 30 |
-
- **PII Detection**: Automatically detects and handles personally identifiable information
|
| 31 |
-
- **Prompt Guard**: Identifies and blocks jailbreak attempts
|
| 32 |
-
- **Safe Routing**: Ensures sensitive prompts are handled appropriately
|
| 33 |
|
| 34 |
-
|
| 35 |
-
- **Semantic Cache**: Caches semantic representations to reduce latency
|
| 36 |
-
- **Tool Selection**: Auto-selects relevant tools to reduce token usage and improve tool selection accuracy
|
| 37 |
|
| 38 |
-
|
| 39 |
-
- **Envoy ExtProc Integration**: Seamlessly integrates with Envoy proxy
|
| 40 |
-
- **Dual Implementation**: Available in both Go (with Rust FFI) and Python
|
| 41 |
-
- **Scalable Design**: Production-ready with comprehensive monitoring
|
| 42 |
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
## π― Use Cases
|
| 55 |
-
|
| 56 |
-
- **Enterprise API Gateways**: Route different types of queries to cost-optimized models
|
| 57 |
-
- **Multi-tenant Platforms**: Provide specialized routing for different customer needs
|
| 58 |
-
- **Development Environments**: Balance cost and performance for different workloads
|
| 59 |
-
- **Production Services**: Ensure optimal model selection with built-in safety measures
|
| 60 |
-
|
| 61 |
-
## π Monitoring & Observability
|
| 62 |
-
|
| 63 |
-
The router provides comprehensive monitoring through:
|
| 64 |
-
- **Grafana Dashboard**: Real-time metrics and performance tracking
|
| 65 |
-
- **Prometheus Metrics**: Detailed routing statistics and performance data
|
| 66 |
-
- **Request Tracing**: Full visibility into routing decisions and performance
|
| 67 |
-
|
| 68 |
-

|
| 69 |
-
|
| 70 |
-
## π Documentation
|
| 71 |
-
|
| 72 |
-
For comprehensive documentation including detailed setup instructions, architecture guides, and API references, visit:
|
| 73 |
-
|
| 74 |
-
**π [Complete Documentation at Read the Docs](https://vllm-semantic-router.com/)**
|
|
|
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: blue
|
| 6 |
sdk: static
|
| 7 |
+
pinned: true
|
| 8 |
+
license: apache-2.0
|
| 9 |
+
thumbnail: >-
|
| 10 |
+
https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/Nwp5bcZfu_D51MUNCN3oO.png
|
| 11 |
+
short_description: 'MoM: Specialized Models for Intelligent Routing'
|
| 12 |
---
|
| 13 |
|
| 14 |
+

|
| 15 |
|
| 16 |
+
**One fabric. Many minds.** We're introducing **MoM** (Mixture of Models)βa family of specialized routing models that power vLLM-SR's intelligent decision-making.
|
| 17 |
|
| 18 |
+
+ vLLM Semantic Router π: [project link](https://github.com/vllm-project/semantic-router)
|
| 19 |
|
| 20 |
+
<!-- truncate -->
|
| 21 |
|
| 22 |
+
## Why MoM?
|
| 23 |
|
| 24 |
+
vLLM-SR solves a critical problem: **how to route LLM requests to the right model at the right time**. Not every query needs the same resourcesβ"What's the weather?" shouldn't cost as much as "Analyze this legal contract."
|
| 25 |
|
| 26 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
+
## MoM System Card
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
+
A quick overview of all MoM models:
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
<div align="center">
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
| Category | Model | Size | Architecture | Base Model | Purpose |
|
| 35 |
+
|----------|-------|------|--------------|------------|---------|
|
| 36 |
+
| **π§ Intelligent Routing** | mom-brain-flash | Flash | Encoder | ModernBERT | Ultra-fast intent classification |
|
| 37 |
+
| | mom-brain-pro | Pro | Decoder | Qwen3 0.6B | Balanced routing with reasoning |
|
| 38 |
+
| | mom-brain-max | Max | Decoder | Qwen3 1.7B | Maximum accuracy for complex decisions |
|
| 39 |
+
| **π Similarity Search** | mom-similarity-flash | Flash | Encoder | ModernBERT | Semantic similarity matching |
|
| 40 |
+
| **π Prompt Guardian** | mom-jailbreak-flash | Flash | Encoder | ModernBERT | Jailbreak/attack detection |
|
| 41 |
+
| | mom-pii-flash | Flash | Encoder | ModernBERT | PII detection & privacy protection |
|
| 42 |
+
| **π― SLM Experts** | mom-expert-math-flash | Flash | Decoder | Qwen3 0.6B | Backend math problem solver |
|
| 43 |
+
| | mom-expert-science-flash | Flash | Decoder | Qwen3 0.6B | Backend science problem solver |
|
| 44 |
+
| | mom-expert-social-flash | Flash | Decoder | Qwen3 0.6B | Backend social sciences solver |
|
| 45 |
+
| | mom-expert-humanities-flash | Flash | Decoder | Qwen3 0.6B | Backend humanities solver |
|
| 46 |
+
| | mom-expert-law-flash | Flash | Decoder | Qwen3 0.6B | Backend law problem solver |
|
| 47 |
+
| | mom-expert-generalist-flash | Flash | Decoder | Qwen3 0.6B | Backend generalist solver |
|
| 48 |
|
| 49 |
+
</div>
|
| 50 |
|
| 51 |
+
**Key Insights:**
|
| 52 |
|
| 53 |
+
- **4 Categories**: 3 for routing (Intelligent Routing, Similarity Search, Prompt Guardian) + 1 for backend problem solving (SLM Experts)
|
| 54 |
+
- **ModernBERT** (encoder-only) β Sub-10ms latency for high-throughput routing
|
| 55 |
+
- **Qwen3** (decoder-only) β Explainable routing decisions + domain-specific problem solving
|
| 56 |
+
- **Flash** models achieve 10,000+ QPS on commodity hardware
|
| 57 |
+
- **SLM Experts** are not routersβthey are specialized backend models that solve domain-specific problems
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|