--- title: Cascade - Intelligent LLM Router emoji: 🌊 colorFrom: purple colorTo: blue sdk: docker pinned: false --- # Cascade 🌊 **Intelligent LLM Request Router** - Reduce API costs by 60%+ through smart routing and semantic caching. [![CI](https://github.com/ayushm98/cascade/actions/workflows/ci.yml/badge.svg)](https://github.com/ayushm98/cascade/actions/workflows/ci.yml) [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md) ## 🚀 Try It Live [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/ayushm98/cascade) Experience Cascade's intelligent routing and cost optimization in action! ## Overview Cascade is an intelligent LLM proxy that automatically routes requests to the most cost-effective model based on query complexity. Simple queries go to free local models (Ollama), while complex queries are routed to powerful cloud models (GPT-4o). ### Key Features - **ML-Powered Routing**: Fine-tuned DistilBERT classifier predicts query complexity in <20ms - **Semantic Caching**: Vector similarity search finds cached responses for similar queries - **OpenAI Compatible**: Drop-in replacement for OpenAI API - **Cost Analytics**: Real-time dashboard showing savings and usage metrics - **60%+ Cost Reduction**: Typical savings by routing simple queries to free models ## Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Cascade │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Request ──► Semantic Cache ──► Cache Hit? ──► Return │ │ │ │ │ ▼ (miss) │ │ ML Classifier │ │ │ │ │ ┌──────────┼──────────┐ │ │ ▼ ▼ ▼ │ │ Simple Medium Complex │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ Llama3.2 GPT-4o-mini GPT-4o │ │ (free) ($0.15/1M) ($2.50/1M) │ │ │ └─────────────────────────────────────────────────────────────┘ ``` ## Quick Start ### Prerequisites - Python 3.11+ - Docker & Docker Compose (optional) - Ollama (for local models) - OpenAI API key ### Installation ```bash # Clone the repository git clone https://github.com/ayushm98/cascade.git cd cascade # Install dependencies pip install poetry poetry install # Set up environment cp .env.example .env # Edit .env with your API keys ``` ### Running with Docker ```bash # Start all services docker-compose up -d # API available at http://localhost:8000 # UI available at http://localhost:8501 ``` ### Running Locally ```bash # Start the API server poetry run uvicorn cascade.api.main:app --reload # Start the Streamlit UI (in another terminal) poetry run streamlit run src/cascade/ui/app.py ``` ## Usage ### API Usage Cascade is OpenAI-compatible. Just change your base URL: ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed" # Uses your configured key ) # Automatic routing based on complexity response = client.chat.completions.create( model="auto", # Let Cascade choose the best model messages=[{"role": "user", "content": "What is 2+2?"}] ) ``` ### Forcing a Specific Model ```python # Force GPT-4o for complex tasks response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Write a compiler..."}] ) ``` ### Checking Stats ```bash curl http://localhost:8000/v1/stats ``` ```json { "total_requests": 1247, "cost": { "actual": 2.34, "baseline": 7.89, "saved_dollars": 5.55, "saved_percentage": 70.3 }, "cache": { "hit_rate": 42.6 } } ``` ## Configuration | Environment Variable | Default | Description | |---------------------|---------|-------------| | `OPENAI_API_KEY` | - | OpenAI API key | | `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama server URL | | `REDIS_HOST` | `localhost` | Redis host | | `QDRANT_URL` | `http://localhost:6333` | Qdrant server URL | | `SIMILARITY_THRESHOLD` | `0.92` | Semantic cache threshold | | `CACHE_TTL` | `3600` | Cache TTL in seconds | ## Project Structure ``` cascade/ ├── src/cascade/ │ ├── api/ # FastAPI application │ ├── cache/ # Redis + Qdrant caching │ ├── cost/ # Cost tracking & analytics │ ├── providers/ # LLM provider adapters │ ├── router/ # ML classifier & routing │ └── ui/ # Streamlit dashboard ├── ml/ # ML training pipeline │ ├── data/ # Dataset loading │ ├── training/ # Model training │ └── export/ # ONNX conversion ├── tests/ # Test suite └── docker-compose.yml ``` ## How It Works 1. **Request Arrives**: User sends a chat completion request 2. **Cache Check**: Check semantic cache for similar previous queries 3. **Complexity Classification**: ML model predicts query complexity (0-1) 4. **Routing Decision**: - Score < 0.35 → Ollama (free) - Score 0.35-0.70 → GPT-4o-mini ($0.15/1M tokens) - Score > 0.70 → GPT-4o ($2.50/1M tokens) 5. **Response**: Forward to selected model, cache result, return ## Development ```bash # Run tests make test # Run linting make lint # Format code make format # Train the classifier make train # Export to ONNX make export-onnx ``` ## Deployment ### Railway (Recommended) Railway offers the easiest deployment with automatic builds: ```bash # Install Railway CLI npm install -g @railway/cli # Login and deploy railway login railway init railway up # Set environment variables in Railway dashboard: # - OPENAI_API_KEY # - REDIS_URL (add Redis plugin) ``` [![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/cascade) ### Fly.io ```bash # Install Fly CLI curl -L https://fly.io/install.sh | sh # Login and deploy fly auth login fly launch fly secrets set OPENAI_API_KEY=sk-your-key fly deploy ``` ### Render 1. Fork this repository 2. Connect to Render 3. Use the `render.yaml` blueprint 4. Set `OPENAI_API_KEY` in environment variables ### Docker (Self-hosted) ```bash # Build and run with docker-compose docker-compose up -d # Or build manually docker build -t cascade . docker run -p 8000:8000 -e OPENAI_API_KEY=sk-xxx cascade ``` ### Environment Variables for Production | Variable | Required | Description | |----------|----------|-------------| | `OPENAI_API_KEY` | Yes | Your OpenAI API key | | `REDIS_URL` | No | Redis connection URL (for caching) | | `QDRANT_URL` | No | Qdrant URL (for semantic cache) | | `PORT` | No | Server port (default: 8000) | ## API Endpoints | Endpoint | Method | Description | |----------|--------|-------------| | `/` | GET | Service info | | `/health` | GET | Health check | | `/v1/chat/completions` | POST | OpenAI-compatible chat | | `/v1/models` | GET | List available models | | `/v1/stats` | GET | Usage statistics | ## Contributing Contributions are welcome! Please read our contributing guidelines first. ## License MIT License - see [LICENSE](LICENSE) for details.