cascade / README.md
ayushm98's picture
fix: configure Hugging Face Spaces deployment with Docker SDK
dfcf109
---
title: Cascade - Intelligent LLM Router
emoji: 🌊
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
---
# Cascade 🌊
**Intelligent LLM Request Router** - Reduce API costs by 60%+ through smart routing and semantic caching.
[![CI](https://github.com/ayushm98/cascade/actions/workflows/ci.yml/badge.svg)](https://github.com/ayushm98/cascade/actions/workflows/ci.yml)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)
## πŸš€ Try It Live
[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/ayushm98/cascade)
Experience Cascade's intelligent routing and cost optimization in action!
## Overview
Cascade is an intelligent LLM proxy that automatically routes requests to the most cost-effective model based on query complexity. Simple queries go to free local models (Ollama), while complex queries are routed to powerful cloud models (GPT-4o).
### Key Features
- **ML-Powered Routing**: Fine-tuned DistilBERT classifier predicts query complexity in <20ms
- **Semantic Caching**: Vector similarity search finds cached responses for similar queries
- **OpenAI Compatible**: Drop-in replacement for OpenAI API
- **Cost Analytics**: Real-time dashboard showing savings and usage metrics
- **60%+ Cost Reduction**: Typical savings by routing simple queries to free models
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Cascade β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ Request ──► Semantic Cache ──► Cache Hit? ──► Return β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό (miss) β”‚
β”‚ ML Classifier β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β–Ό β–Ό β–Ό β”‚
β”‚ Simple Medium Complex β”‚
β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β–Ό β–Ό β–Ό β”‚
β”‚ Llama3.2 GPT-4o-mini GPT-4o β”‚
β”‚ (free) ($0.15/1M) ($2.50/1M) β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Quick Start
### Prerequisites
- Python 3.11+
- Docker & Docker Compose (optional)
- Ollama (for local models)
- OpenAI API key
### Installation
```bash
# Clone the repository
git clone https://github.com/ayushm98/cascade.git
cd cascade
# Install dependencies
pip install poetry
poetry install
# Set up environment
cp .env.example .env
# Edit .env with your API keys
```
### Running with Docker
```bash
# Start all services
docker-compose up -d
# API available at http://localhost:8000
# UI available at http://localhost:8501
```
### Running Locally
```bash
# Start the API server
poetry run uvicorn cascade.api.main:app --reload
# Start the Streamlit UI (in another terminal)
poetry run streamlit run src/cascade/ui/app.py
```
## Usage
### API Usage
Cascade is OpenAI-compatible. Just change your base URL:
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # Uses your configured key
)
# Automatic routing based on complexity
response = client.chat.completions.create(
model="auto", # Let Cascade choose the best model
messages=[{"role": "user", "content": "What is 2+2?"}]
)
```
### Forcing a Specific Model
```python
# Force GPT-4o for complex tasks
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a compiler..."}]
)
```
### Checking Stats
```bash
curl http://localhost:8000/v1/stats
```
```json
{
"total_requests": 1247,
"cost": {
"actual": 2.34,
"baseline": 7.89,
"saved_dollars": 5.55,
"saved_percentage": 70.3
},
"cache": {
"hit_rate": 42.6
}
}
```
## Configuration
| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `OPENAI_API_KEY` | - | OpenAI API key |
| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama server URL |
| `REDIS_HOST` | `localhost` | Redis host |
| `QDRANT_URL` | `http://localhost:6333` | Qdrant server URL |
| `SIMILARITY_THRESHOLD` | `0.92` | Semantic cache threshold |
| `CACHE_TTL` | `3600` | Cache TTL in seconds |
## Project Structure
```
cascade/
β”œβ”€β”€ src/cascade/
β”‚ β”œβ”€β”€ api/ # FastAPI application
β”‚ β”œβ”€β”€ cache/ # Redis + Qdrant caching
β”‚ β”œβ”€β”€ cost/ # Cost tracking & analytics
β”‚ β”œβ”€β”€ providers/ # LLM provider adapters
β”‚ β”œβ”€β”€ router/ # ML classifier & routing
β”‚ └── ui/ # Streamlit dashboard
β”œβ”€β”€ ml/ # ML training pipeline
β”‚ β”œβ”€β”€ data/ # Dataset loading
β”‚ β”œβ”€β”€ training/ # Model training
β”‚ └── export/ # ONNX conversion
β”œβ”€β”€ tests/ # Test suite
└── docker-compose.yml
```
## How It Works
1. **Request Arrives**: User sends a chat completion request
2. **Cache Check**: Check semantic cache for similar previous queries
3. **Complexity Classification**: ML model predicts query complexity (0-1)
4. **Routing Decision**:
- Score < 0.35 β†’ Ollama (free)
- Score 0.35-0.70 β†’ GPT-4o-mini ($0.15/1M tokens)
- Score > 0.70 β†’ GPT-4o ($2.50/1M tokens)
5. **Response**: Forward to selected model, cache result, return
## Development
```bash
# Run tests
make test
# Run linting
make lint
# Format code
make format
# Train the classifier
make train
# Export to ONNX
make export-onnx
```
## Deployment
### Railway (Recommended)
Railway offers the easiest deployment with automatic builds:
```bash
# Install Railway CLI
npm install -g @railway/cli
# Login and deploy
railway login
railway init
railway up
# Set environment variables in Railway dashboard:
# - OPENAI_API_KEY
# - REDIS_URL (add Redis plugin)
```
[![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/cascade)
### Fly.io
```bash
# Install Fly CLI
curl -L https://fly.io/install.sh | sh
# Login and deploy
fly auth login
fly launch
fly secrets set OPENAI_API_KEY=sk-your-key
fly deploy
```
### Render
1. Fork this repository
2. Connect to Render
3. Use the `render.yaml` blueprint
4. Set `OPENAI_API_KEY` in environment variables
### Docker (Self-hosted)
```bash
# Build and run with docker-compose
docker-compose up -d
# Or build manually
docker build -t cascade .
docker run -p 8000:8000 -e OPENAI_API_KEY=sk-xxx cascade
```
### Environment Variables for Production
| Variable | Required | Description |
|----------|----------|-------------|
| `OPENAI_API_KEY` | Yes | Your OpenAI API key |
| `REDIS_URL` | No | Redis connection URL (for caching) |
| `QDRANT_URL` | No | Qdrant URL (for semantic cache) |
| `PORT` | No | Server port (default: 8000) |
## API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/` | GET | Service info |
| `/health` | GET | Health check |
| `/v1/chat/completions` | POST | OpenAI-compatible chat |
| `/v1/models` | GET | List available models |
| `/v1/stats` | GET | Usage statistics |
## Contributing
Contributions are welcome! Please read our contributing guidelines first.
## License
MIT License - see [LICENSE](LICENSE) for details.