|
|
--- |
|
|
title: Cascade - Intelligent LLM Router |
|
|
emoji: π |
|
|
colorFrom: purple |
|
|
colorTo: blue |
|
|
sdk: docker |
|
|
pinned: false |
|
|
--- |
|
|
|
|
|
# Cascade π |
|
|
|
|
|
**Intelligent LLM Request Router** - Reduce API costs by 60%+ through smart routing and semantic caching. |
|
|
|
|
|
[](https://github.com/ayushm98/cascade/actions/workflows/ci.yml) |
|
|
[](https://www.python.org/downloads/) |
|
|
[](https://opensource.org/licenses/MIT) |
|
|
[](https://github.com/psf/black) |
|
|
[](https://github.com/astral-sh/ruff) |
|
|
[](CONTRIBUTING.md) |
|
|
|
|
|
## π Try It Live |
|
|
|
|
|
[](https://huggingface.co/spaces/ayushm98/cascade) |
|
|
|
|
|
Experience Cascade's intelligent routing and cost optimization in action! |
|
|
|
|
|
## Overview |
|
|
|
|
|
Cascade is an intelligent LLM proxy that automatically routes requests to the most cost-effective model based on query complexity. Simple queries go to free local models (Ollama), while complex queries are routed to powerful cloud models (GPT-4o). |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- **ML-Powered Routing**: Fine-tuned DistilBERT classifier predicts query complexity in <20ms |
|
|
- **Semantic Caching**: Vector similarity search finds cached responses for similar queries |
|
|
- **OpenAI Compatible**: Drop-in replacement for OpenAI API |
|
|
- **Cost Analytics**: Real-time dashboard showing savings and usage metrics |
|
|
- **60%+ Cost Reduction**: Typical savings by routing simple queries to free models |
|
|
|
|
|
## Architecture |
|
|
|
|
|
``` |
|
|
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
β Cascade β |
|
|
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ |
|
|
β β |
|
|
β Request βββΊ Semantic Cache βββΊ Cache Hit? βββΊ Return β |
|
|
β β β |
|
|
β βΌ (miss) β |
|
|
β ML Classifier β |
|
|
β β β |
|
|
β ββββββββββββΌβββββββββββ β |
|
|
β βΌ βΌ βΌ β |
|
|
β Simple Medium Complex β |
|
|
β β β β β |
|
|
β βΌ βΌ βΌ β |
|
|
β Llama3.2 GPT-4o-mini GPT-4o β |
|
|
β (free) ($0.15/1M) ($2.50/1M) β |
|
|
β β |
|
|
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
``` |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Prerequisites |
|
|
|
|
|
- Python 3.11+ |
|
|
- Docker & Docker Compose (optional) |
|
|
- Ollama (for local models) |
|
|
- OpenAI API key |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
# Clone the repository |
|
|
git clone https://github.com/ayushm98/cascade.git |
|
|
cd cascade |
|
|
|
|
|
# Install dependencies |
|
|
pip install poetry |
|
|
poetry install |
|
|
|
|
|
# Set up environment |
|
|
cp .env.example .env |
|
|
# Edit .env with your API keys |
|
|
``` |
|
|
|
|
|
### Running with Docker |
|
|
|
|
|
```bash |
|
|
# Start all services |
|
|
docker-compose up -d |
|
|
|
|
|
# API available at http://localhost:8000 |
|
|
# UI available at http://localhost:8501 |
|
|
``` |
|
|
|
|
|
### Running Locally |
|
|
|
|
|
```bash |
|
|
# Start the API server |
|
|
poetry run uvicorn cascade.api.main:app --reload |
|
|
|
|
|
# Start the Streamlit UI (in another terminal) |
|
|
poetry run streamlit run src/cascade/ui/app.py |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### API Usage |
|
|
|
|
|
Cascade is OpenAI-compatible. Just change your base URL: |
|
|
|
|
|
```python |
|
|
from openai import OpenAI |
|
|
|
|
|
client = OpenAI( |
|
|
base_url="http://localhost:8000/v1", |
|
|
api_key="not-needed" # Uses your configured key |
|
|
) |
|
|
|
|
|
# Automatic routing based on complexity |
|
|
response = client.chat.completions.create( |
|
|
model="auto", # Let Cascade choose the best model |
|
|
messages=[{"role": "user", "content": "What is 2+2?"}] |
|
|
) |
|
|
``` |
|
|
|
|
|
### Forcing a Specific Model |
|
|
|
|
|
```python |
|
|
# Force GPT-4o for complex tasks |
|
|
response = client.chat.completions.create( |
|
|
model="gpt-4o", |
|
|
messages=[{"role": "user", "content": "Write a compiler..."}] |
|
|
) |
|
|
``` |
|
|
|
|
|
### Checking Stats |
|
|
|
|
|
```bash |
|
|
curl http://localhost:8000/v1/stats |
|
|
``` |
|
|
|
|
|
```json |
|
|
{ |
|
|
"total_requests": 1247, |
|
|
"cost": { |
|
|
"actual": 2.34, |
|
|
"baseline": 7.89, |
|
|
"saved_dollars": 5.55, |
|
|
"saved_percentage": 70.3 |
|
|
}, |
|
|
"cache": { |
|
|
"hit_rate": 42.6 |
|
|
} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Configuration |
|
|
|
|
|
| Environment Variable | Default | Description | |
|
|
|---------------------|---------|-------------| |
|
|
| `OPENAI_API_KEY` | - | OpenAI API key | |
|
|
| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama server URL | |
|
|
| `REDIS_HOST` | `localhost` | Redis host | |
|
|
| `QDRANT_URL` | `http://localhost:6333` | Qdrant server URL | |
|
|
| `SIMILARITY_THRESHOLD` | `0.92` | Semantic cache threshold | |
|
|
| `CACHE_TTL` | `3600` | Cache TTL in seconds | |
|
|
|
|
|
## Project Structure |
|
|
|
|
|
``` |
|
|
cascade/ |
|
|
βββ src/cascade/ |
|
|
β βββ api/ # FastAPI application |
|
|
β βββ cache/ # Redis + Qdrant caching |
|
|
β βββ cost/ # Cost tracking & analytics |
|
|
β βββ providers/ # LLM provider adapters |
|
|
β βββ router/ # ML classifier & routing |
|
|
β βββ ui/ # Streamlit dashboard |
|
|
βββ ml/ # ML training pipeline |
|
|
β βββ data/ # Dataset loading |
|
|
β βββ training/ # Model training |
|
|
β βββ export/ # ONNX conversion |
|
|
βββ tests/ # Test suite |
|
|
βββ docker-compose.yml |
|
|
``` |
|
|
|
|
|
## How It Works |
|
|
|
|
|
1. **Request Arrives**: User sends a chat completion request |
|
|
2. **Cache Check**: Check semantic cache for similar previous queries |
|
|
3. **Complexity Classification**: ML model predicts query complexity (0-1) |
|
|
4. **Routing Decision**: |
|
|
- Score < 0.35 β Ollama (free) |
|
|
- Score 0.35-0.70 β GPT-4o-mini ($0.15/1M tokens) |
|
|
- Score > 0.70 β GPT-4o ($2.50/1M tokens) |
|
|
5. **Response**: Forward to selected model, cache result, return |
|
|
|
|
|
## Development |
|
|
|
|
|
```bash |
|
|
# Run tests |
|
|
make test |
|
|
|
|
|
# Run linting |
|
|
make lint |
|
|
|
|
|
# Format code |
|
|
make format |
|
|
|
|
|
# Train the classifier |
|
|
make train |
|
|
|
|
|
# Export to ONNX |
|
|
make export-onnx |
|
|
``` |
|
|
|
|
|
## Deployment |
|
|
|
|
|
### Railway (Recommended) |
|
|
|
|
|
Railway offers the easiest deployment with automatic builds: |
|
|
|
|
|
```bash |
|
|
# Install Railway CLI |
|
|
npm install -g @railway/cli |
|
|
|
|
|
# Login and deploy |
|
|
railway login |
|
|
railway init |
|
|
railway up |
|
|
|
|
|
# Set environment variables in Railway dashboard: |
|
|
# - OPENAI_API_KEY |
|
|
# - REDIS_URL (add Redis plugin) |
|
|
``` |
|
|
|
|
|
[](https://railway.app/template/cascade) |
|
|
|
|
|
### Fly.io |
|
|
|
|
|
```bash |
|
|
# Install Fly CLI |
|
|
curl -L https://fly.io/install.sh | sh |
|
|
|
|
|
# Login and deploy |
|
|
fly auth login |
|
|
fly launch |
|
|
fly secrets set OPENAI_API_KEY=sk-your-key |
|
|
fly deploy |
|
|
``` |
|
|
|
|
|
### Render |
|
|
|
|
|
1. Fork this repository |
|
|
2. Connect to Render |
|
|
3. Use the `render.yaml` blueprint |
|
|
4. Set `OPENAI_API_KEY` in environment variables |
|
|
|
|
|
### Docker (Self-hosted) |
|
|
|
|
|
```bash |
|
|
# Build and run with docker-compose |
|
|
docker-compose up -d |
|
|
|
|
|
# Or build manually |
|
|
docker build -t cascade . |
|
|
docker run -p 8000:8000 -e OPENAI_API_KEY=sk-xxx cascade |
|
|
``` |
|
|
|
|
|
### Environment Variables for Production |
|
|
|
|
|
| Variable | Required | Description | |
|
|
|----------|----------|-------------| |
|
|
| `OPENAI_API_KEY` | Yes | Your OpenAI API key | |
|
|
| `REDIS_URL` | No | Redis connection URL (for caching) | |
|
|
| `QDRANT_URL` | No | Qdrant URL (for semantic cache) | |
|
|
| `PORT` | No | Server port (default: 8000) | |
|
|
|
|
|
## API Endpoints |
|
|
|
|
|
| Endpoint | Method | Description | |
|
|
|----------|--------|-------------| |
|
|
| `/` | GET | Service info | |
|
|
| `/health` | GET | Health check | |
|
|
| `/v1/chat/completions` | POST | OpenAI-compatible chat | |
|
|
| `/v1/models` | GET | List available models | |
|
|
| `/v1/stats` | GET | Usage statistics | |
|
|
|
|
|
## Contributing |
|
|
|
|
|
Contributions are welcome! Please read our contributing guidelines first. |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - see [LICENSE](LICENSE) for details. |
|
|
|