|
|
--- |
|
|
title: Cascade - Intelligent LLM Router |
|
|
emoji: π |
|
|
colorFrom: purple |
|
|
colorTo: blue |
|
|
sdk: streamlit |
|
|
sdk_version: "1.31.0" |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
--- |
|
|
|
|
|
# Cascade π |
|
|
|
|
|
**Intelligent LLM Request Router** - Reduce API costs by 60%+ through smart routing and semantic caching. |
|
|
|
|
|
## What is Cascade? |
|
|
|
|
|
Cascade is an intelligent proxy that automatically routes LLM requests to the most cost-effective model based on query complexity: |
|
|
|
|
|
- **Simple queries** β Free local models (Ollama) or GPT-3.5 |
|
|
- **Medium queries** β GPT-4o-mini ($0.15/1M tokens) |
|
|
- **Complex queries** β GPT-4o ($2.50/1M tokens) |
|
|
|
|
|
## Features |
|
|
|
|
|
- π― **ML-Powered Routing** - Predicts query complexity in <20ms |
|
|
- π° **60%+ Cost Savings** - Routes simple queries to cheaper models |
|
|
- β‘ **Semantic Caching** - Vector similarity search for cached responses |
|
|
- π **Real-Time Analytics** - Dashboard showing savings and usage metrics |
|
|
- π **OpenAI Compatible** - Drop-in replacement for OpenAI API |
|
|
|
|
|
## Using This Space |
|
|
|
|
|
This Space provides a demo UI where you can: |
|
|
|
|
|
1. **Test Routing** - See how different queries get routed to different models |
|
|
2. **View Analytics** - Track cost savings and cache hit rates |
|
|
3. **Interactive Chat** - Try example queries and see real-time responses |
|
|
|
|
|
## Configuration |
|
|
|
|
|
To use this with real LLM APIs, set these environment variables: |
|
|
|
|
|
- `OPENAI_API_KEY` - Your OpenAI API key |
|
|
- `OLLAMA_BASE_URL` - Ollama server URL (optional) |
|
|
- `REDIS_URL` - Redis connection for caching (optional) |
|
|
- `QDRANT_URL` - Qdrant server for semantic cache (optional) |
|
|
|
|
|
## Learn More |
|
|
|
|
|
- [GitHub Repository](https://github.com/ayushm98/cascade) |
|
|
- [Documentation](https://github.com/ayushm98/cascade#readme) |
|
|
- [Contributing Guide](https://github.com/ayushm98/cascade/blob/main/CONTRIBUTING.md) |
|
|
|
|
|
## Architecture |
|
|
|
|
|
``` |
|
|
Request β Cache Check β ML Classifier β Route to Model |
|
|
β β β |
|
|
Semantic Simple/Med/ Ollama/GPT-4o-mini/ |
|
|
Cache Complex GPT-4o |
|
|
``` |
|
|
|
|
|
## Tech Stack |
|
|
|
|
|
- **Backend**: FastAPI, Python 3.11+ |
|
|
- **ML**: DistilBERT (ONNX), Sentence Transformers |
|
|
- **Caching**: Redis (exact match), Qdrant (semantic) |
|
|
- **UI**: Streamlit, Plotly |
|
|
- **Deployment**: Docker, Hugging Face Spaces |
|
|
|
|
|
--- |
|
|
|
|
|
Built with β€οΈ by [ayushm98](https://github.com/ayushm98) |
|
|
|