Spaces:

ayushm98
/

cascade

Running

App Files Files Community

ayushm98 commited on May 14, 2025

Commit

dd76f80

1 Parent(s): 3ef5fd1

Add comprehensive README with usage docs and architecture

Browse files

Files changed (1) hide show

README.md +205 -0

README.md ADDED Viewed

	@@ -0,0 +1,205 @@

+# Cascade 🌊
+**Intelligent LLM Request Router** - Reduce API costs by 60%+ through smart routing and semantic caching.
+[![CI](https://github.com/ayushm98/cascade/actions/workflows/ci.yml/badge.svg)](https://github.com/ayushm98/cascade/actions/workflows/ci.yml)
+[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+## Overview
+Cascade is an intelligent LLM proxy that automatically routes requests to the most cost-effective model based on query complexity. Simple queries go to free local models (Ollama), while complex queries are routed to powerful cloud models (GPT-4o).
+### Key Features
+- **ML-Powered Routing**: Fine-tuned DistilBERT classifier predicts query complexity in <20ms
+- **Semantic Caching**: Vector similarity search finds cached responses for similar queries
+- **OpenAI Compatible**: Drop-in replacement for OpenAI API
+- **Cost Analytics**: Real-time dashboard showing savings and usage metrics
+- **60%+ Cost Reduction**: Typical savings by routing simple queries to free models
+## Architecture
+```
+┌─────────────────────────────────────────────────────────────┐
+│                         Cascade                              │
+├─────────────────────────────────────────────────────────────┤
+│                                                              │
+│   Request ──► Semantic Cache ──► Cache Hit? ──► Return      │
+│                     │                                        │
+│                     ▼ (miss)                                 │
+│              ML Classifier                                   │
+│                     │                                        │
+│         ┌──────────┼──────────┐                             │
+│         ▼          ▼          ▼                             │
+│      Simple     Medium     Complex                          │
+│         │          │          │                             │
+│         ▼          ▼          ▼                             │
+│     Llama3.2  GPT-4o-mini  GPT-4o                          │
+│      (free)   ($0.15/1M)  ($2.50/1M)                       │
+│                                                              │
+└─────────────────────────────────────────────────────────────┘
+```
+## Quick Start
+### Prerequisites
+- Python 3.11+
+- Docker & Docker Compose (optional)
+- Ollama (for local models)
+- OpenAI API key
+### Installation
+```bash
+# Clone the repository
+git clone https://github.com/ayushm98/cascade.git
+cd cascade
+# Install dependencies
+pip install poetry
+poetry install
+# Set up environment
+cp .env.example .env
+# Edit .env with your API keys
+```
+### Running with Docker
+```bash
+# Start all services
+docker-compose up -d
+# API available at http://localhost:8000
+# UI available at http://localhost:8501
+```
+### Running Locally
+```bash
+# Start the API server
+poetry run uvicorn cascade.api.main:app --reload
+# Start the Streamlit UI (in another terminal)
+poetry run streamlit run src/cascade/ui/app.py
+```
+## Usage
+### API Usage
+Cascade is OpenAI-compatible. Just change your base URL:
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="not-needed"  # Uses your configured key
+)
+# Automatic routing based on complexity
+response = client.chat.completions.create(
+    model="auto",  # Let Cascade choose the best model
+    messages=[{"role": "user", "content": "What is 2+2?"}]
+)
+```
+### Forcing a Specific Model
+```python
+# Force GPT-4o for complex tasks
+response = client.chat.completions.create(
+    model="gpt-4o",
+    messages=[{"role": "user", "content": "Write a compiler..."}]
+)
+```
+### Checking Stats
+```bash
+curl http://localhost:8000/v1/stats
+```
+```json
+{
+  "total_requests": 1247,
+  "cost": {
+    "actual": 2.34,
+    "baseline": 7.89,
+    "saved_dollars": 5.55,
+    "saved_percentage": 70.3
+  },
+  "cache": {
+    "hit_rate": 42.6
+  }
+}
+```
+## Configuration
+| Environment Variable | Default | Description |
+|---------------------|---------|-------------|
+| `OPENAI_API_KEY` | - | OpenAI API key |
+| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama server URL |
+| `REDIS_HOST` | `localhost` | Redis host |
+| `QDRANT_URL` | `http://localhost:6333` | Qdrant server URL |
+| `SIMILARITY_THRESHOLD` | `0.92` | Semantic cache threshold |
+| `CACHE_TTL` | `3600` | Cache TTL in seconds |
+## Project Structure
+```
+cascade/
+├── src/cascade/
+│   ├── api/           # FastAPI application
+│   ├── cache/         # Redis + Qdrant caching
+│   ├── cost/          # Cost tracking & analytics
+│   ├── providers/     # LLM provider adapters
+│   ├── router/        # ML classifier & routing
+│   └── ui/            # Streamlit dashboard
+├── ml/                # ML training pipeline
+│   ├── data/          # Dataset loading
+│   ├── training/      # Model training
+│   └── export/        # ONNX conversion
+├── tests/             # Test suite
+└── docker-compose.yml
+```
+## How It Works
+1. **Request Arrives**: User sends a chat completion request
+2. **Cache Check**: Check semantic cache for similar previous queries
+3. **Complexity Classification**: ML model predicts query complexity (0-1)
+4. **Routing Decision**:
+   - Score < 0.35 → Ollama (free)
+   - Score 0.35-0.70 → GPT-4o-mini ($0.15/1M tokens)
+   - Score > 0.70 → GPT-4o ($2.50/1M tokens)
+5. **Response**: Forward to selected model, cache result, return
+## Development
+```bash
+# Run tests
+poetry run pytest
+# Run linting
+poetry run ruff check src/
+poetry run black src/
+# Train the classifier
+python -m ml.training.train --dataset easy2hard --epochs 5
+# Export to ONNX
+python -m ml.export.convert_to_onnx
+```
+## Contributing
+Contributions are welcome! Please read our contributing guidelines first.
+## License
+MIT License - see [LICENSE](LICENSE) for details.