README / README.md
Xunzhuo's picture
Update README.md
7686c24 verified
|
raw
history blame
3.93 kB
metadata
title: README
emoji: πŸ“Š
colorFrom: blue
colorTo: blue
sdk: static
pinned: false

vLLM Semantic Router

License

An Mixture-of-Models (MoM) router that intelligently directs OpenAI API requests to the most suitable models from a defined pool based on Semantic Understanding of the request's intent.

This is achieved using BERT classification. Conceptually similar to Mixture-of-Experts (MoE) which lives within a model, this system selects the best entire model for the nature of the task.

image/png

πŸš€ Key Features

🎯 Auto-selection of Models

Intelligently routes requests to specialized models based on semantic understanding:

  • Math queries β†’ Math-specialized models
  • Creative writing β†’ Creative-specialized models
  • Code generation β†’ Code-specialized models
  • General queries β†’ Balanced general-purpose models

πŸ›‘οΈ Security & Privacy

  • PII Detection: Automatically detects and handles personally identifiable information
  • Prompt Guard: Identifies and blocks jailbreak attempts
  • Safe Routing: Ensures sensitive prompts are handled appropriately

⚑ Performance Optimization

  • Semantic Cache: Caches semantic representations to reduce latency
  • Tool Selection: Auto-selects relevant tools to reduce token usage and improve tool selection accuracy

πŸ—οΈ Architecture

  • Envoy ExtProc Integration: Seamlessly integrates with Envoy proxy
  • Dual Implementation: Available in both Go (with Rust FFI) and Python
  • Scalable Design: Production-ready with comprehensive monitoring

πŸ“Š Performance Benefits

Our testing shows significant improvements in model accuracy through specialized routing.

image/webp

πŸ› οΈ Architecture Overview

image/png

🎯 Use Cases

  • Enterprise API Gateways: Route different types of queries to cost-optimized models
  • Multi-tenant Platforms: Provide specialized routing for different customer needs
  • Development Environments: Balance cost and performance for different workloads
  • Production Services: Ensure optimal model selection with built-in safety measures

πŸ“ˆ Monitoring & Observability

The router provides comprehensive monitoring through:

  • Grafana Dashboard: Real-time metrics and performance tracking
  • Prometheus Metrics: Detailed routing statistics and performance data
  • Request Tracing: Full visibility into routing decisions and performance

image/png

πŸ“– Documentation

For comprehensive documentation including detailed setup instructions, architecture guides, and API references, visit:

πŸ‘‰ Complete Documentation at Read the Docs

The documentation includes: