Spaces:

ayushm98
/

cascade

Sleeping

App Files Files Community

cascade / README.md

ayushm98

fix: configure Hugging Face Spaces deployment with Docker SDK

dfcf109 17 days ago

preview code

raw

history blame contribute delete

8.74 kB

	---
	title: Cascade - Intelligent LLM Router
	emoji: 🌊
	colorFrom: purple
	colorTo: blue
	sdk: docker
	pinned: false
	---

	# Cascade 🌊

	Intelligent LLM Request Router - Reduce API costs by 60%+ through smart routing and semantic caching.

	[![CI](https://github.com/ayushm98/cascade/actions/workflows/ci.yml/badge.svg)](https://github.com/ayushm98/cascade/actions/workflows/ci.yml)
	[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
	[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
	[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)

	## 🚀 Try It Live

	[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/ayushm98/cascade)

	Experience Cascade's intelligent routing and cost optimization in action!

	## Overview

	Cascade is an intelligent LLM proxy that automatically routes requests to the most cost-effective model based on query complexity. Simple queries go to free local models (Ollama), while complex queries are routed to powerful cloud models (GPT-4o).

	### Key Features

	- ML-Powered Routing: Fine-tuned DistilBERT classifier predicts query complexity in <20ms
	- Semantic Caching: Vector similarity search finds cached responses for similar queries
	- OpenAI Compatible: Drop-in replacement for OpenAI API
	- Cost Analytics: Real-time dashboard showing savings and usage metrics
	- 60%+ Cost Reduction: Typical savings by routing simple queries to free models

	## Architecture

	```
	┌─────────────────────────────────────────────────────────────┐
	│ Cascade │
	├─────────────────────────────────────────────────────────────┤
	│ │
	│ Request ──► Semantic Cache ──► Cache Hit? ──► Return │
	│ │ │
	│ ▼ (miss) │
	│ ML Classifier │
	│ │ │
	│ ┌──────────┼──────────┐ │
	│ ▼ ▼ ▼ │
	│ Simple Medium Complex │
	│ │ │ │ │
	│ ▼ ▼ ▼ │
	│ Llama3.2 GPT-4o-mini GPT-4o │
	│ (free) ($0.15/1M) ($2.50/1M) │
	│ │
	└─────────────────────────────────────────────────────────────┘
	```

	## Quick Start

	### Prerequisites

	- Python 3.11+
	- Docker & Docker Compose (optional)
	- Ollama (for local models)
	- OpenAI API key

	### Installation

	```bash
	# Clone the repository
	git clone https://github.com/ayushm98/cascade.git
	cd cascade

	# Install dependencies
	pip install poetry
	poetry install

	# Set up environment
	cp .env.example .env
	# Edit .env with your API keys
	```

	### Running with Docker

	```bash
	# Start all services
	docker-compose up -d

	# API available at http://localhost:8000
	# UI available at http://localhost:8501
	```

	### Running Locally

	```bash
	# Start the API server
	poetry run uvicorn cascade.api.main:app --reload

	# Start the Streamlit UI (in another terminal)
	poetry run streamlit run src/cascade/ui/app.py
	```

	## Usage

	### API Usage

	Cascade is OpenAI-compatible. Just change your base URL:

	```python
	from openai import OpenAI

	client = OpenAI(
	base_url="http://localhost:8000/v1",
	api_key="not-needed" # Uses your configured key
	)

	# Automatic routing based on complexity
	response = client.chat.completions.create(
	model="auto", # Let Cascade choose the best model
	messages=[{"role": "user", "content": "What is 2+2?"}]
	)
	```

	### Forcing a Specific Model

	```python
	# Force GPT-4o for complex tasks
	response = client.chat.completions.create(
	model="gpt-4o",
	messages=[{"role": "user", "content": "Write a compiler..."}]
	)
	```

	### Checking Stats

	```bash
	curl http://localhost:8000/v1/stats
	```

	```json
	{
	"total_requests": 1247,
	"cost": {
	"actual": 2.34,
	"baseline": 7.89,
	"saved_dollars": 5.55,
	"saved_percentage": 70.3
	},
	"cache": {
	"hit_rate": 42.6
	}
	}
	```

	## Configuration

	\| Environment Variable \| Default \| Description \|
	\|---------------------\|---------\|-------------\|
	\| `OPENAI_API_KEY` \| - \| OpenAI API key \|
	\| `OLLAMA_BASE_URL` \| `http://localhost:11434` \| Ollama server URL \|
	\| `REDIS_HOST` \| `localhost` \| Redis host \|
	\| `QDRANT_URL` \| `http://localhost:6333` \| Qdrant server URL \|
	\| `SIMILARITY_THRESHOLD` \| `0.92` \| Semantic cache threshold \|
	\| `CACHE_TTL` \| `3600` \| Cache TTL in seconds \|

	## Project Structure

	```
	cascade/
	├── src/cascade/
	│ ├── api/ # FastAPI application
	│ ├── cache/ # Redis + Qdrant caching
	│ ├── cost/ # Cost tracking & analytics
	│ ├── providers/ # LLM provider adapters
	│ ├── router/ # ML classifier & routing
	│ └── ui/ # Streamlit dashboard
	├── ml/ # ML training pipeline
	│ ├── data/ # Dataset loading
	│ ├── training/ # Model training
	│ └── export/ # ONNX conversion
	├── tests/ # Test suite
	└── docker-compose.yml
	```

	## How It Works

	1. Request Arrives: User sends a chat completion request
	2. Cache Check: Check semantic cache for similar previous queries
	3. Complexity Classification: ML model predicts query complexity (0-1)
	4. Routing Decision:
	- Score < 0.35 → Ollama (free)
	- Score 0.35-0.70 → GPT-4o-mini ($0.15/1M tokens)
	- Score > 0.70 → GPT-4o ($2.50/1M tokens)
	5. Response: Forward to selected model, cache result, return

	## Development

	```bash
	# Run tests
	make test

	# Run linting
	make lint

	# Format code
	make format

	# Train the classifier
	make train

	# Export to ONNX
	make export-onnx
	```

	## Deployment

	### Railway (Recommended)

	Railway offers the easiest deployment with automatic builds:

	```bash
	# Install Railway CLI
	npm install -g @railway/cli

	# Login and deploy
	railway login
	railway init
	railway up

	# Set environment variables in Railway dashboard:
	# - OPENAI_API_KEY
	# - REDIS_URL (add Redis plugin)
	```

	[![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/cascade)

	### Fly.io

	```bash
	# Install Fly CLI
	curl -L https://fly.io/install.sh \| sh

	# Login and deploy
	fly auth login
	fly launch
	fly secrets set OPENAI_API_KEY=sk-your-key
	fly deploy
	```

	### Render

	1. Fork this repository
	2. Connect to Render
	3. Use the `render.yaml` blueprint
	4. Set `OPENAI_API_KEY` in environment variables

	### Docker (Self-hosted)

	```bash
	# Build and run with docker-compose
	docker-compose up -d

	# Or build manually
	docker build -t cascade .
	docker run -p 8000:8000 -e OPENAI_API_KEY=sk-xxx cascade
	```

	### Environment Variables for Production

	\| Variable \| Required \| Description \|
	\|----------\|----------\|-------------\|
	\| `OPENAI_API_KEY` \| Yes \| Your OpenAI API key \|
	\| `REDIS_URL` \| No \| Redis connection URL (for caching) \|
	\| `QDRANT_URL` \| No \| Qdrant URL (for semantic cache) \|
	\| `PORT` \| No \| Server port (default: 8000) \|

	## API Endpoints

	\| Endpoint \| Method \| Description \|
	\|----------\|--------\|-------------\|
	\| `/` \| GET \| Service info \|
	\| `/health` \| GET \| Health check \|
	\| `/v1/chat/completions` \| POST \| OpenAI-compatible chat \|
	\| `/v1/models` \| GET \| List available models \|
	\| `/v1/stats` \| GET \| Usage statistics \|

	## Contributing

	Contributions are welcome! Please read our contributing guidelines first.

	## License

	MIT License - see [LICENSE](LICENSE) for details.