Xunzhuo commited on
Commit
032b7e8
Β·
verified Β·
1 Parent(s): 3d2cb9d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -53
README.md CHANGED
@@ -4,71 +4,54 @@ emoji: πŸ“Š
4
  colorFrom: blue
5
  colorTo: blue
6
  sdk: static
7
- pinned: false
 
 
 
 
8
  ---
9
 
10
- # vLLM Semantic Router
11
 
12
- [![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
13
 
14
- An **Mixture-of-Models (MoM)** router that intelligently directs OpenAI API requests to the most suitable models from a defined pool based on **Semantic Understanding** of the request's intent.
15
 
16
- This is achieved using BERT classification. Conceptually similar to Mixture-of-Experts (MoE) which lives *within* a model, this system selects the best *entire model* for the nature of the task.
17
 
18
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/AM-CIx1BGnRJkifzOcCHe.png)
19
 
20
- ## πŸš€ Key Features
21
 
22
- ### 🎯 **Auto-selection of Models**
23
- Intelligently routes requests to specialized models based on semantic understanding:
24
- - **Math queries** β†’ Math-specialized models
25
- - **Creative writing** β†’ Creative-specialized models
26
- - **Code generation** β†’ Code-specialized models
27
- - **General queries** β†’ Balanced general-purpose models
28
 
29
- ### πŸ›‘οΈ **Security & Privacy**
30
- - **PII Detection**: Automatically detects and handles personally identifiable information
31
- - **Prompt Guard**: Identifies and blocks jailbreak attempts
32
- - **Safe Routing**: Ensures sensitive prompts are handled appropriately
33
 
34
- ### ⚑ **Performance Optimization**
35
- - **Semantic Cache**: Caches semantic representations to reduce latency
36
- - **Tool Selection**: Auto-selects relevant tools to reduce token usage and improve tool selection accuracy
37
 
38
- ### πŸ—οΈ **Architecture**
39
- - **Envoy ExtProc Integration**: Seamlessly integrates with Envoy proxy
40
- - **Dual Implementation**: Available in both Go (with Rust FFI) and Python
41
- - **Scalable Design**: Production-ready with comprehensive monitoring
42
 
43
- ## πŸ“Š Performance Benefits
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
- Our testing shows significant improvements in model accuracy through specialized routing.
46
 
47
- ![image/webp](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/efbREtUgJWTsU3iu5Xhu9.webp)
48
 
49
- ## πŸ› οΈ Architecture Overview
50
-
51
-
52
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/jBZuH9Uy-lsVfGel5p5FT.png)
53
-
54
- ## 🎯 Use Cases
55
-
56
- - **Enterprise API Gateways**: Route different types of queries to cost-optimized models
57
- - **Multi-tenant Platforms**: Provide specialized routing for different customer needs
58
- - **Development Environments**: Balance cost and performance for different workloads
59
- - **Production Services**: Ensure optimal model selection with built-in safety measures
60
-
61
- ## πŸ“ˆ Monitoring & Observability
62
-
63
- The router provides comprehensive monitoring through:
64
- - **Grafana Dashboard**: Real-time metrics and performance tracking
65
- - **Prometheus Metrics**: Detailed routing statistics and performance data
66
- - **Request Tracing**: Full visibility into routing decisions and performance
67
-
68
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/ZfofBg68tHlXaHEz2arCh.png)
69
-
70
- ## πŸ“– Documentation
71
-
72
- For comprehensive documentation including detailed setup instructions, architecture guides, and API references, visit:
73
-
74
- **πŸ‘‰ [Complete Documentation at Read the Docs](https://vllm-semantic-router.com/)**
 
4
  colorFrom: blue
5
  colorTo: blue
6
  sdk: static
7
+ pinned: true
8
+ license: apache-2.0
9
+ thumbnail: >-
10
+ https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/Nwp5bcZfu_D51MUNCN3oO.png
11
+ short_description: 'MoM: Specialized Models for Intelligent Routing'
12
  ---
13
 
14
+ ![mom-family](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/M9vyenphR9xlPPfSOJyOh.png)
15
 
16
+ **One fabric. Many minds.** We're introducing **MoM** (Mixture of Models)β€”a family of specialized routing models that power vLLM-SR's intelligent decision-making.
17
 
18
+ + vLLM Semantic Router πŸ‘‰: [project link](https://github.com/vllm-project/semantic-router)
19
 
20
+ <!-- truncate -->
21
 
22
+ ## Why MoM?
23
 
24
+ vLLM-SR solves a critical problem: **how to route LLM requests to the right model at the right time**. Not every query needs the same resourcesβ€”"What's the weather?" shouldn't cost as much as "Analyze this legal contract."
25
 
26
+ ![a6d3ff18-c3e8-4e7e-9545-a95a5d525b89](https://cdn-uploads.huggingface.co/production/uploads/66f8caead3186746f4524419/k7yjJaG6CqXCMnuiQOXyZ.png)
 
 
 
 
 
27
 
28
+ ## MoM System Card
 
 
 
29
 
30
+ A quick overview of all MoM models:
 
 
31
 
32
+ <div align="center">
 
 
 
33
 
34
+ | Category | Model | Size | Architecture | Base Model | Purpose |
35
+ |----------|-------|------|--------------|------------|---------|
36
+ | **🧠 Intelligent Routing** | mom-brain-flash | Flash | Encoder | ModernBERT | Ultra-fast intent classification |
37
+ | | mom-brain-pro | Pro | Decoder | Qwen3 0.6B | Balanced routing with reasoning |
38
+ | | mom-brain-max | Max | Decoder | Qwen3 1.7B | Maximum accuracy for complex decisions |
39
+ | **πŸ” Similarity Search** | mom-similarity-flash | Flash | Encoder | ModernBERT | Semantic similarity matching |
40
+ | **πŸ”’ Prompt Guardian** | mom-jailbreak-flash | Flash | Encoder | ModernBERT | Jailbreak/attack detection |
41
+ | | mom-pii-flash | Flash | Encoder | ModernBERT | PII detection & privacy protection |
42
+ | **🎯 SLM Experts** | mom-expert-math-flash | Flash | Decoder | Qwen3 0.6B | Backend math problem solver |
43
+ | | mom-expert-science-flash | Flash | Decoder | Qwen3 0.6B | Backend science problem solver |
44
+ | | mom-expert-social-flash | Flash | Decoder | Qwen3 0.6B | Backend social sciences solver |
45
+ | | mom-expert-humanities-flash | Flash | Decoder | Qwen3 0.6B | Backend humanities solver |
46
+ | | mom-expert-law-flash | Flash | Decoder | Qwen3 0.6B | Backend law problem solver |
47
+ | | mom-expert-generalist-flash | Flash | Decoder | Qwen3 0.6B | Backend generalist solver |
48
 
49
+ </div>
50
 
51
+ **Key Insights:**
52
 
53
+ - **4 Categories**: 3 for routing (Intelligent Routing, Similarity Search, Prompt Guardian) + 1 for backend problem solving (SLM Experts)
54
+ - **ModernBERT** (encoder-only) β†’ Sub-10ms latency for high-throughput routing
55
+ - **Qwen3** (decoder-only) β†’ Explainable routing decisions + domain-specific problem solving
56
+ - **Flash** models achieve 10,000+ QPS on commodity hardware
57
+ - **SLM Experts** are not routersβ€”they are specialized backend models that solve domain-specific problems