metadata
title: Cascade - Intelligent LLM Router
emoji: π
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
Cascade π
Intelligent LLM Request Router - Reduce API costs by 60%+ through smart routing and semantic caching.
π Try It Live
Experience Cascade's intelligent routing and cost optimization in action!
Overview
Cascade is an intelligent LLM proxy that automatically routes requests to the most cost-effective model based on query complexity. Simple queries go to free local models (Ollama), while complex queries are routed to powerful cloud models (GPT-4o).
Key Features
- ML-Powered Routing: Fine-tuned DistilBERT classifier predicts query complexity in <20ms
- Semantic Caching: Vector similarity search finds cached responses for similar queries
- OpenAI Compatible: Drop-in replacement for OpenAI API
- Cost Analytics: Real-time dashboard showing savings and usage metrics
- 60%+ Cost Reduction: Typical savings by routing simple queries to free models
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cascade β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Request βββΊ Semantic Cache βββΊ Cache Hit? βββΊ Return β
β β β
β βΌ (miss) β
β ML Classifier β
β β β
β ββββββββββββΌβββββββββββ β
β βΌ βΌ βΌ β
β Simple Medium Complex β
β β β β β
β βΌ βΌ βΌ β
β Llama3.2 GPT-4o-mini GPT-4o β
β (free) ($0.15/1M) ($2.50/1M) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Quick Start
Prerequisites
- Python 3.11+
- Docker & Docker Compose (optional)
- Ollama (for local models)
- OpenAI API key
Installation
# Clone the repository
git clone https://github.com/ayushm98/cascade.git
cd cascade
# Install dependencies
pip install poetry
poetry install
# Set up environment
cp .env.example .env
# Edit .env with your API keys
Running with Docker
# Start all services
docker-compose up -d
# API available at http://localhost:8000
# UI available at http://localhost:8501
Running Locally
# Start the API server
poetry run uvicorn cascade.api.main:app --reload
# Start the Streamlit UI (in another terminal)
poetry run streamlit run src/cascade/ui/app.py
Usage
API Usage
Cascade is OpenAI-compatible. Just change your base URL:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # Uses your configured key
)
# Automatic routing based on complexity
response = client.chat.completions.create(
model="auto", # Let Cascade choose the best model
messages=[{"role": "user", "content": "What is 2+2?"}]
)
Forcing a Specific Model
# Force GPT-4o for complex tasks
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a compiler..."}]
)
Checking Stats
curl http://localhost:8000/v1/stats
{
"total_requests": 1247,
"cost": {
"actual": 2.34,
"baseline": 7.89,
"saved_dollars": 5.55,
"saved_percentage": 70.3
},
"cache": {
"hit_rate": 42.6
}
}
Configuration
| Environment Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
- | OpenAI API key |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL |
REDIS_HOST |
localhost |
Redis host |
QDRANT_URL |
http://localhost:6333 |
Qdrant server URL |
SIMILARITY_THRESHOLD |
0.92 |
Semantic cache threshold |
CACHE_TTL |
3600 |
Cache TTL in seconds |
Project Structure
cascade/
βββ src/cascade/
β βββ api/ # FastAPI application
β βββ cache/ # Redis + Qdrant caching
β βββ cost/ # Cost tracking & analytics
β βββ providers/ # LLM provider adapters
β βββ router/ # ML classifier & routing
β βββ ui/ # Streamlit dashboard
βββ ml/ # ML training pipeline
β βββ data/ # Dataset loading
β βββ training/ # Model training
β βββ export/ # ONNX conversion
βββ tests/ # Test suite
βββ docker-compose.yml
How It Works
- Request Arrives: User sends a chat completion request
- Cache Check: Check semantic cache for similar previous queries
- Complexity Classification: ML model predicts query complexity (0-1)
- Routing Decision:
- Score < 0.35 β Ollama (free)
- Score 0.35-0.70 β GPT-4o-mini ($0.15/1M tokens)
- Score > 0.70 β GPT-4o ($2.50/1M tokens)
- Response: Forward to selected model, cache result, return
Development
# Run tests
make test
# Run linting
make lint
# Format code
make format
# Train the classifier
make train
# Export to ONNX
make export-onnx
Deployment
Railway (Recommended)
Railway offers the easiest deployment with automatic builds:
# Install Railway CLI
npm install -g @railway/cli
# Login and deploy
railway login
railway init
railway up
# Set environment variables in Railway dashboard:
# - OPENAI_API_KEY
# - REDIS_URL (add Redis plugin)
Fly.io
# Install Fly CLI
curl -L https://fly.io/install.sh | sh
# Login and deploy
fly auth login
fly launch
fly secrets set OPENAI_API_KEY=sk-your-key
fly deploy
Render
- Fork this repository
- Connect to Render
- Use the
render.yamlblueprint - Set
OPENAI_API_KEYin environment variables
Docker (Self-hosted)
# Build and run with docker-compose
docker-compose up -d
# Or build manually
docker build -t cascade .
docker run -p 8000:8000 -e OPENAI_API_KEY=sk-xxx cascade
Environment Variables for Production
| Variable | Required | Description |
|---|---|---|
OPENAI_API_KEY |
Yes | Your OpenAI API key |
REDIS_URL |
No | Redis connection URL (for caching) |
QDRANT_URL |
No | Qdrant URL (for semantic cache) |
PORT |
No | Server port (default: 8000) |
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Service info |
/health |
GET | Health check |
/v1/chat/completions |
POST | OpenAI-compatible chat |
/v1/models |
GET | List available models |
/v1/stats |
GET | Usage statistics |
Contributing
Contributions are welcome! Please read our contributing guidelines first.
License
MIT License - see LICENSE for details.