Spaces:

ayushm98
/

cascade

Sleeping

App Files Files Community

cascade / README.md

ayushm98

fix: configure Hugging Face Spaces deployment with Docker SDK

dfcf109 17 days ago

preview code

raw

history blame contribute delete

8.74 kB

metadata

title: Cascade - Intelligent LLM Router
emoji: 🌊
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false

Cascade 🌊

Intelligent LLM Request Router - Reduce API costs by 60%+ through smart routing and semantic caching.

🚀 Try It Live

Experience Cascade's intelligent routing and cost optimization in action!

Overview

Cascade is an intelligent LLM proxy that automatically routes requests to the most cost-effective model based on query complexity. Simple queries go to free local models (Ollama), while complex queries are routed to powerful cloud models (GPT-4o).

Key Features

ML-Powered Routing: Fine-tuned DistilBERT classifier predicts query complexity in <20ms
Semantic Caching: Vector similarity search finds cached responses for similar queries
OpenAI Compatible: Drop-in replacement for OpenAI API
Cost Analytics: Real-time dashboard showing savings and usage metrics
60%+ Cost Reduction: Typical savings by routing simple queries to free models

Architecture

┌─────────────────────────────────────────────────────────────┐
│                         Cascade                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   Request ──► Semantic Cache ──► Cache Hit? ──► Return      │
│                     │                                        │
│                     ▼ (miss)                                 │
│              ML Classifier                                   │
│                     │                                        │
│         ┌──────────┼──────────┐                             │
│         ▼          ▼          ▼                             │
│      Simple     Medium     Complex                          │
│         │          │          │                             │
│         ▼          ▼          ▼                             │
│     Llama3.2  GPT-4o-mini  GPT-4o                          │
│      (free)   ($0.15/1M)  ($2.50/1M)                       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Quick Start

Prerequisites

Python 3.11+
Docker & Docker Compose (optional)
Ollama (for local models)
OpenAI API key

Installation

# Clone the repository
git clone https://github.com/ayushm98/cascade.git
cd cascade

# Install dependencies
pip install poetry
poetry install

# Set up environment
cp .env.example .env
# Edit .env with your API keys

Running with Docker

# Start all services
docker-compose up -d

# API available at http://localhost:8000
# UI available at http://localhost:8501

Running Locally

# Start the API server
poetry run uvicorn cascade.api.main:app --reload

# Start the Streamlit UI (in another terminal)
poetry run streamlit run src/cascade/ui/app.py

Usage

API Usage

Cascade is OpenAI-compatible. Just change your base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # Uses your configured key
)

# Automatic routing based on complexity
response = client.chat.completions.create(
    model="auto",  # Let Cascade choose the best model
    messages=[{"role": "user", "content": "What is 2+2?"}]
)

Forcing a Specific Model

# Force GPT-4o for complex tasks
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a compiler..."}]
)

Checking Stats

curl http://localhost:8000/v1/stats

{
  "total_requests": 1247,
  "cost": {
    "actual": 2.34,
    "baseline": 7.89,
    "saved_dollars": 5.55,
    "saved_percentage": 70.3
  },
  "cache": {
    "hit_rate": 42.6
  }
}

Configuration

Environment Variable	Default	Description
`OPENAI_API_KEY`	-	OpenAI API key
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama server URL
`REDIS_HOST`	`localhost`	Redis host
`QDRANT_URL`	`http://localhost:6333`	Qdrant server URL
`SIMILARITY_THRESHOLD`	`0.92`	Semantic cache threshold
`CACHE_TTL`	`3600`	Cache TTL in seconds

Project Structure

cascade/
├── src/cascade/
│   ├── api/           # FastAPI application
│   ├── cache/         # Redis + Qdrant caching
│   ├── cost/          # Cost tracking & analytics
│   ├── providers/     # LLM provider adapters
│   ├── router/        # ML classifier & routing
│   └── ui/            # Streamlit dashboard
├── ml/                # ML training pipeline
│   ├── data/          # Dataset loading
│   ├── training/      # Model training
│   └── export/        # ONNX conversion
├── tests/             # Test suite
└── docker-compose.yml

How It Works

Request Arrives: User sends a chat completion request
Cache Check: Check semantic cache for similar previous queries
Complexity Classification: ML model predicts query complexity (0-1)
Routing Decision:
- Score < 0.35 → Ollama (free)
- Score 0.35-0.70 → GPT-4o-mini ($0.15/1M tokens)
- Score > 0.70 → GPT-4o ($2.50/1M tokens)
Response: Forward to selected model, cache result, return

Development

# Run tests
make test

# Run linting
make lint

# Format code
make format

# Train the classifier
make train

# Export to ONNX
make export-onnx

Deployment

Railway (Recommended)

Railway offers the easiest deployment with automatic builds:

# Install Railway CLI
npm install -g @railway/cli

# Login and deploy
railway login
railway init
railway up

# Set environment variables in Railway dashboard:
# - OPENAI_API_KEY
# - REDIS_URL (add Redis plugin)

Fly.io

# Install Fly CLI
curl -L https://fly.io/install.sh | sh

# Login and deploy
fly auth login
fly launch
fly secrets set OPENAI_API_KEY=sk-your-key
fly deploy

Render

Fork this repository
Connect to Render
Use the render.yaml blueprint
Set OPENAI_API_KEY in environment variables

Docker (Self-hosted)

# Build and run with docker-compose
docker-compose up -d

# Or build manually
docker build -t cascade .
docker run -p 8000:8000 -e OPENAI_API_KEY=sk-xxx cascade

Environment Variables for Production

Variable	Required	Description
`OPENAI_API_KEY`	Yes	Your OpenAI API key
`REDIS_URL`	No	Redis connection URL (for caching)
`QDRANT_URL`	No	Qdrant URL (for semantic cache)
`PORT`	No	Server port (default: 8000)

API Endpoints

Endpoint	Method	Description
`/`	GET	Service info
`/health`	GET	Health check
`/v1/chat/completions`	POST	OpenAI-compatible chat
`/v1/models`	GET	List available models
`/v1/stats`	GET	Usage statistics

Contributing

Contributions are welcome! Please read our contributing guidelines first.

License

MIT License - see LICENSE for details.