File size: 8,736 Bytes
65124e6 dfcf109 65124e6 dd76f80 50d5432 7754ca2 dd76f80 7809c4b dd76f80 7809c4b dd76f80 7809c4b dd76f80 7809c4b dd76f80 7809c4b dd76f80 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 |
---
title: Cascade - Intelligent LLM Router
emoji: π
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
---
# Cascade π
**Intelligent LLM Request Router** - Reduce API costs by 60%+ through smart routing and semantic caching.
[](https://github.com/ayushm98/cascade/actions/workflows/ci.yml)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/psf/black)
[](https://github.com/astral-sh/ruff)
[](CONTRIBUTING.md)
## π Try It Live
[](https://huggingface.co/spaces/ayushm98/cascade)
Experience Cascade's intelligent routing and cost optimization in action!
## Overview
Cascade is an intelligent LLM proxy that automatically routes requests to the most cost-effective model based on query complexity. Simple queries go to free local models (Ollama), while complex queries are routed to powerful cloud models (GPT-4o).
### Key Features
- **ML-Powered Routing**: Fine-tuned DistilBERT classifier predicts query complexity in <20ms
- **Semantic Caching**: Vector similarity search finds cached responses for similar queries
- **OpenAI Compatible**: Drop-in replacement for OpenAI API
- **Cost Analytics**: Real-time dashboard showing savings and usage metrics
- **60%+ Cost Reduction**: Typical savings by routing simple queries to free models
## Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cascade β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Request βββΊ Semantic Cache βββΊ Cache Hit? βββΊ Return β
β β β
β βΌ (miss) β
β ML Classifier β
β β β
β ββββββββββββΌβββββββββββ β
β βΌ βΌ βΌ β
β Simple Medium Complex β
β β β β β
β βΌ βΌ βΌ β
β Llama3.2 GPT-4o-mini GPT-4o β
β (free) ($0.15/1M) ($2.50/1M) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
## Quick Start
### Prerequisites
- Python 3.11+
- Docker & Docker Compose (optional)
- Ollama (for local models)
- OpenAI API key
### Installation
```bash
# Clone the repository
git clone https://github.com/ayushm98/cascade.git
cd cascade
# Install dependencies
pip install poetry
poetry install
# Set up environment
cp .env.example .env
# Edit .env with your API keys
```
### Running with Docker
```bash
# Start all services
docker-compose up -d
# API available at http://localhost:8000
# UI available at http://localhost:8501
```
### Running Locally
```bash
# Start the API server
poetry run uvicorn cascade.api.main:app --reload
# Start the Streamlit UI (in another terminal)
poetry run streamlit run src/cascade/ui/app.py
```
## Usage
### API Usage
Cascade is OpenAI-compatible. Just change your base URL:
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # Uses your configured key
)
# Automatic routing based on complexity
response = client.chat.completions.create(
model="auto", # Let Cascade choose the best model
messages=[{"role": "user", "content": "What is 2+2?"}]
)
```
### Forcing a Specific Model
```python
# Force GPT-4o for complex tasks
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a compiler..."}]
)
```
### Checking Stats
```bash
curl http://localhost:8000/v1/stats
```
```json
{
"total_requests": 1247,
"cost": {
"actual": 2.34,
"baseline": 7.89,
"saved_dollars": 5.55,
"saved_percentage": 70.3
},
"cache": {
"hit_rate": 42.6
}
}
```
## Configuration
| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `OPENAI_API_KEY` | - | OpenAI API key |
| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama server URL |
| `REDIS_HOST` | `localhost` | Redis host |
| `QDRANT_URL` | `http://localhost:6333` | Qdrant server URL |
| `SIMILARITY_THRESHOLD` | `0.92` | Semantic cache threshold |
| `CACHE_TTL` | `3600` | Cache TTL in seconds |
## Project Structure
```
cascade/
βββ src/cascade/
β βββ api/ # FastAPI application
β βββ cache/ # Redis + Qdrant caching
β βββ cost/ # Cost tracking & analytics
β βββ providers/ # LLM provider adapters
β βββ router/ # ML classifier & routing
β βββ ui/ # Streamlit dashboard
βββ ml/ # ML training pipeline
β βββ data/ # Dataset loading
β βββ training/ # Model training
β βββ export/ # ONNX conversion
βββ tests/ # Test suite
βββ docker-compose.yml
```
## How It Works
1. **Request Arrives**: User sends a chat completion request
2. **Cache Check**: Check semantic cache for similar previous queries
3. **Complexity Classification**: ML model predicts query complexity (0-1)
4. **Routing Decision**:
- Score < 0.35 β Ollama (free)
- Score 0.35-0.70 β GPT-4o-mini ($0.15/1M tokens)
- Score > 0.70 β GPT-4o ($2.50/1M tokens)
5. **Response**: Forward to selected model, cache result, return
## Development
```bash
# Run tests
make test
# Run linting
make lint
# Format code
make format
# Train the classifier
make train
# Export to ONNX
make export-onnx
```
## Deployment
### Railway (Recommended)
Railway offers the easiest deployment with automatic builds:
```bash
# Install Railway CLI
npm install -g @railway/cli
# Login and deploy
railway login
railway init
railway up
# Set environment variables in Railway dashboard:
# - OPENAI_API_KEY
# - REDIS_URL (add Redis plugin)
```
[](https://railway.app/template/cascade)
### Fly.io
```bash
# Install Fly CLI
curl -L https://fly.io/install.sh | sh
# Login and deploy
fly auth login
fly launch
fly secrets set OPENAI_API_KEY=sk-your-key
fly deploy
```
### Render
1. Fork this repository
2. Connect to Render
3. Use the `render.yaml` blueprint
4. Set `OPENAI_API_KEY` in environment variables
### Docker (Self-hosted)
```bash
# Build and run with docker-compose
docker-compose up -d
# Or build manually
docker build -t cascade .
docker run -p 8000:8000 -e OPENAI_API_KEY=sk-xxx cascade
```
### Environment Variables for Production
| Variable | Required | Description |
|----------|----------|-------------|
| `OPENAI_API_KEY` | Yes | Your OpenAI API key |
| `REDIS_URL` | No | Redis connection URL (for caching) |
| `QDRANT_URL` | No | Qdrant URL (for semantic cache) |
| `PORT` | No | Server port (default: 8000) |
## API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/` | GET | Service info |
| `/health` | GET | Health check |
| `/v1/chat/completions` | POST | OpenAI-compatible chat |
| `/v1/models` | GET | List available models |
| `/v1/stats` | GET | Usage statistics |
## Contributing
Contributions are welcome! Please read our contributing guidelines first.
## License
MIT License - see [LICENSE](LICENSE) for details.
|