File size: 8,736 Bytes
65124e6
 
 
 
 
dfcf109
65124e6
 
 
dd76f80
 
 
 
 
 
 
50d5432
 
 
7754ca2
 
 
 
 
 
dd76f80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7809c4b
dd76f80
 
7809c4b
 
 
 
dd76f80
 
7809c4b
dd76f80
 
7809c4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd76f80
 
7809c4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd76f80
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
---
title: Cascade - Intelligent LLM Router
emoji: 🌊
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
---

# Cascade 🌊

**Intelligent LLM Request Router** - Reduce API costs by 60%+ through smart routing and semantic caching.

[![CI](https://github.com/ayushm98/cascade/actions/workflows/ci.yml/badge.svg)](https://github.com/ayushm98/cascade/actions/workflows/ci.yml)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)

## πŸš€ Try It Live

[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/ayushm98/cascade)

Experience Cascade's intelligent routing and cost optimization in action!

## Overview

Cascade is an intelligent LLM proxy that automatically routes requests to the most cost-effective model based on query complexity. Simple queries go to free local models (Ollama), while complex queries are routed to powerful cloud models (GPT-4o).

### Key Features

- **ML-Powered Routing**: Fine-tuned DistilBERT classifier predicts query complexity in <20ms
- **Semantic Caching**: Vector similarity search finds cached responses for similar queries
- **OpenAI Compatible**: Drop-in replacement for OpenAI API
- **Cost Analytics**: Real-time dashboard showing savings and usage metrics
- **60%+ Cost Reduction**: Typical savings by routing simple queries to free models

## Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Cascade                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚   Request ──► Semantic Cache ──► Cache Hit? ──► Return      β”‚
β”‚                     β”‚                                        β”‚
β”‚                     β–Ό (miss)                                 β”‚
β”‚              ML Classifier                                   β”‚
β”‚                     β”‚                                        β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚         β–Ό          β–Ό          β–Ό                             β”‚
β”‚      Simple     Medium     Complex                          β”‚
β”‚         β”‚          β”‚          β”‚                             β”‚
β”‚         β–Ό          β–Ό          β–Ό                             β”‚
β”‚     Llama3.2  GPT-4o-mini  GPT-4o                          β”‚
β”‚      (free)   ($0.15/1M)  ($2.50/1M)                       β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Quick Start

### Prerequisites

- Python 3.11+
- Docker & Docker Compose (optional)
- Ollama (for local models)
- OpenAI API key

### Installation

```bash
# Clone the repository
git clone https://github.com/ayushm98/cascade.git
cd cascade

# Install dependencies
pip install poetry
poetry install

# Set up environment
cp .env.example .env
# Edit .env with your API keys
```

### Running with Docker

```bash
# Start all services
docker-compose up -d

# API available at http://localhost:8000
# UI available at http://localhost:8501
```

### Running Locally

```bash
# Start the API server
poetry run uvicorn cascade.api.main:app --reload

# Start the Streamlit UI (in another terminal)
poetry run streamlit run src/cascade/ui/app.py
```

## Usage

### API Usage

Cascade is OpenAI-compatible. Just change your base URL:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # Uses your configured key
)

# Automatic routing based on complexity
response = client.chat.completions.create(
    model="auto",  # Let Cascade choose the best model
    messages=[{"role": "user", "content": "What is 2+2?"}]
)
```

### Forcing a Specific Model

```python
# Force GPT-4o for complex tasks
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a compiler..."}]
)
```

### Checking Stats

```bash
curl http://localhost:8000/v1/stats
```

```json
{
  "total_requests": 1247,
  "cost": {
    "actual": 2.34,
    "baseline": 7.89,
    "saved_dollars": 5.55,
    "saved_percentage": 70.3
  },
  "cache": {
    "hit_rate": 42.6
  }
}
```

## Configuration

| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `OPENAI_API_KEY` | - | OpenAI API key |
| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama server URL |
| `REDIS_HOST` | `localhost` | Redis host |
| `QDRANT_URL` | `http://localhost:6333` | Qdrant server URL |
| `SIMILARITY_THRESHOLD` | `0.92` | Semantic cache threshold |
| `CACHE_TTL` | `3600` | Cache TTL in seconds |

## Project Structure

```
cascade/
β”œβ”€β”€ src/cascade/
β”‚   β”œβ”€β”€ api/           # FastAPI application
β”‚   β”œβ”€β”€ cache/         # Redis + Qdrant caching
β”‚   β”œβ”€β”€ cost/          # Cost tracking & analytics
β”‚   β”œβ”€β”€ providers/     # LLM provider adapters
β”‚   β”œβ”€β”€ router/        # ML classifier & routing
β”‚   └── ui/            # Streamlit dashboard
β”œβ”€β”€ ml/                # ML training pipeline
β”‚   β”œβ”€β”€ data/          # Dataset loading
β”‚   β”œβ”€β”€ training/      # Model training
β”‚   └── export/        # ONNX conversion
β”œβ”€β”€ tests/             # Test suite
└── docker-compose.yml
```

## How It Works

1. **Request Arrives**: User sends a chat completion request
2. **Cache Check**: Check semantic cache for similar previous queries
3. **Complexity Classification**: ML model predicts query complexity (0-1)
4. **Routing Decision**:
   - Score < 0.35 β†’ Ollama (free)
   - Score 0.35-0.70 β†’ GPT-4o-mini ($0.15/1M tokens)
   - Score > 0.70 β†’ GPT-4o ($2.50/1M tokens)
5. **Response**: Forward to selected model, cache result, return

## Development

```bash
# Run tests
make test

# Run linting
make lint

# Format code
make format

# Train the classifier
make train

# Export to ONNX
make export-onnx
```

## Deployment

### Railway (Recommended)

Railway offers the easiest deployment with automatic builds:

```bash
# Install Railway CLI
npm install -g @railway/cli

# Login and deploy
railway login
railway init
railway up

# Set environment variables in Railway dashboard:
# - OPENAI_API_KEY
# - REDIS_URL (add Redis plugin)
```

[![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/cascade)

### Fly.io

```bash
# Install Fly CLI
curl -L https://fly.io/install.sh | sh

# Login and deploy
fly auth login
fly launch
fly secrets set OPENAI_API_KEY=sk-your-key
fly deploy
```

### Render

1. Fork this repository
2. Connect to Render
3. Use the `render.yaml` blueprint
4. Set `OPENAI_API_KEY` in environment variables

### Docker (Self-hosted)

```bash
# Build and run with docker-compose
docker-compose up -d

# Or build manually
docker build -t cascade .
docker run -p 8000:8000 -e OPENAI_API_KEY=sk-xxx cascade
```

### Environment Variables for Production

| Variable | Required | Description |
|----------|----------|-------------|
| `OPENAI_API_KEY` | Yes | Your OpenAI API key |
| `REDIS_URL` | No | Redis connection URL (for caching) |
| `QDRANT_URL` | No | Qdrant URL (for semantic cache) |
| `PORT` | No | Server port (default: 8000) |

## API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/` | GET | Service info |
| `/health` | GET | Health check |
| `/v1/chat/completions` | POST | OpenAI-compatible chat |
| `/v1/models` | GET | List available models |
| `/v1/stats` | GET | Usage statistics |

## Contributing

Contributions are welcome! Please read our contributing guidelines first.

## License

MIT License - see [LICENSE](LICENSE) for details.