cascade / README.md
ayushm98's picture
fix: configure Hugging Face Spaces deployment with Docker SDK
dfcf109
metadata
title: Cascade - Intelligent LLM Router
emoji: 🌊
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false

Cascade 🌊

Intelligent LLM Request Router - Reduce API costs by 60%+ through smart routing and semantic caching.

CI Python 3.11+ License: MIT Code style: black Ruff PRs Welcome

πŸš€ Try It Live

Open in Spaces

Experience Cascade's intelligent routing and cost optimization in action!

Overview

Cascade is an intelligent LLM proxy that automatically routes requests to the most cost-effective model based on query complexity. Simple queries go to free local models (Ollama), while complex queries are routed to powerful cloud models (GPT-4o).

Key Features

  • ML-Powered Routing: Fine-tuned DistilBERT classifier predicts query complexity in <20ms
  • Semantic Caching: Vector similarity search finds cached responses for similar queries
  • OpenAI Compatible: Drop-in replacement for OpenAI API
  • Cost Analytics: Real-time dashboard showing savings and usage metrics
  • 60%+ Cost Reduction: Typical savings by routing simple queries to free models

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Cascade                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚   Request ──► Semantic Cache ──► Cache Hit? ──► Return      β”‚
β”‚                     β”‚                                        β”‚
β”‚                     β–Ό (miss)                                 β”‚
β”‚              ML Classifier                                   β”‚
β”‚                     β”‚                                        β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚         β–Ό          β–Ό          β–Ό                             β”‚
β”‚      Simple     Medium     Complex                          β”‚
β”‚         β”‚          β”‚          β”‚                             β”‚
β”‚         β–Ό          β–Ό          β–Ό                             β”‚
β”‚     Llama3.2  GPT-4o-mini  GPT-4o                          β”‚
β”‚      (free)   ($0.15/1M)  ($2.50/1M)                       β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Start

Prerequisites

  • Python 3.11+
  • Docker & Docker Compose (optional)
  • Ollama (for local models)
  • OpenAI API key

Installation

# Clone the repository
git clone https://github.com/ayushm98/cascade.git
cd cascade

# Install dependencies
pip install poetry
poetry install

# Set up environment
cp .env.example .env
# Edit .env with your API keys

Running with Docker

# Start all services
docker-compose up -d

# API available at http://localhost:8000
# UI available at http://localhost:8501

Running Locally

# Start the API server
poetry run uvicorn cascade.api.main:app --reload

# Start the Streamlit UI (in another terminal)
poetry run streamlit run src/cascade/ui/app.py

Usage

API Usage

Cascade is OpenAI-compatible. Just change your base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # Uses your configured key
)

# Automatic routing based on complexity
response = client.chat.completions.create(
    model="auto",  # Let Cascade choose the best model
    messages=[{"role": "user", "content": "What is 2+2?"}]
)

Forcing a Specific Model

# Force GPT-4o for complex tasks
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a compiler..."}]
)

Checking Stats

curl http://localhost:8000/v1/stats
{
  "total_requests": 1247,
  "cost": {
    "actual": 2.34,
    "baseline": 7.89,
    "saved_dollars": 5.55,
    "saved_percentage": 70.3
  },
  "cache": {
    "hit_rate": 42.6
  }
}

Configuration

Environment Variable Default Description
OPENAI_API_KEY - OpenAI API key
OLLAMA_BASE_URL http://localhost:11434 Ollama server URL
REDIS_HOST localhost Redis host
QDRANT_URL http://localhost:6333 Qdrant server URL
SIMILARITY_THRESHOLD 0.92 Semantic cache threshold
CACHE_TTL 3600 Cache TTL in seconds

Project Structure

cascade/
β”œβ”€β”€ src/cascade/
β”‚   β”œβ”€β”€ api/           # FastAPI application
β”‚   β”œβ”€β”€ cache/         # Redis + Qdrant caching
β”‚   β”œβ”€β”€ cost/          # Cost tracking & analytics
β”‚   β”œβ”€β”€ providers/     # LLM provider adapters
β”‚   β”œβ”€β”€ router/        # ML classifier & routing
β”‚   └── ui/            # Streamlit dashboard
β”œβ”€β”€ ml/                # ML training pipeline
β”‚   β”œβ”€β”€ data/          # Dataset loading
β”‚   β”œβ”€β”€ training/      # Model training
β”‚   └── export/        # ONNX conversion
β”œβ”€β”€ tests/             # Test suite
└── docker-compose.yml

How It Works

  1. Request Arrives: User sends a chat completion request
  2. Cache Check: Check semantic cache for similar previous queries
  3. Complexity Classification: ML model predicts query complexity (0-1)
  4. Routing Decision:
    • Score < 0.35 β†’ Ollama (free)
    • Score 0.35-0.70 β†’ GPT-4o-mini ($0.15/1M tokens)
    • Score > 0.70 β†’ GPT-4o ($2.50/1M tokens)
  5. Response: Forward to selected model, cache result, return

Development

# Run tests
make test

# Run linting
make lint

# Format code
make format

# Train the classifier
make train

# Export to ONNX
make export-onnx

Deployment

Railway (Recommended)

Railway offers the easiest deployment with automatic builds:

# Install Railway CLI
npm install -g @railway/cli

# Login and deploy
railway login
railway init
railway up

# Set environment variables in Railway dashboard:
# - OPENAI_API_KEY
# - REDIS_URL (add Redis plugin)

Deploy on Railway

Fly.io

# Install Fly CLI
curl -L https://fly.io/install.sh | sh

# Login and deploy
fly auth login
fly launch
fly secrets set OPENAI_API_KEY=sk-your-key
fly deploy

Render

  1. Fork this repository
  2. Connect to Render
  3. Use the render.yaml blueprint
  4. Set OPENAI_API_KEY in environment variables

Docker (Self-hosted)

# Build and run with docker-compose
docker-compose up -d

# Or build manually
docker build -t cascade .
docker run -p 8000:8000 -e OPENAI_API_KEY=sk-xxx cascade

Environment Variables for Production

Variable Required Description
OPENAI_API_KEY Yes Your OpenAI API key
REDIS_URL No Redis connection URL (for caching)
QDRANT_URL No Qdrant URL (for semantic cache)
PORT No Server port (default: 8000)

API Endpoints

Endpoint Method Description
/ GET Service info
/health GET Health check
/v1/chat/completions POST OpenAI-compatible chat
/v1/models GET List available models
/v1/stats GET Usage statistics

Contributing

Contributions are welcome! Please read our contributing guidelines first.

License

MIT License - see LICENSE for details.