ayushm98 commited on
Commit
dd76f80
Β·
1 Parent(s): 3ef5fd1

Add comprehensive README with usage docs and architecture

Browse files
Files changed (1) hide show
  1. README.md +205 -0
README.md ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Cascade 🌊
2
+
3
+ **Intelligent LLM Request Router** - Reduce API costs by 60%+ through smart routing and semantic caching.
4
+
5
+ [![CI](https://github.com/ayushm98/cascade/actions/workflows/ci.yml/badge.svg)](https://github.com/ayushm98/cascade/actions/workflows/ci.yml)
6
+ [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
7
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
8
+
9
+ ## Overview
10
+
11
+ Cascade is an intelligent LLM proxy that automatically routes requests to the most cost-effective model based on query complexity. Simple queries go to free local models (Ollama), while complex queries are routed to powerful cloud models (GPT-4o).
12
+
13
+ ### Key Features
14
+
15
+ - **ML-Powered Routing**: Fine-tuned DistilBERT classifier predicts query complexity in <20ms
16
+ - **Semantic Caching**: Vector similarity search finds cached responses for similar queries
17
+ - **OpenAI Compatible**: Drop-in replacement for OpenAI API
18
+ - **Cost Analytics**: Real-time dashboard showing savings and usage metrics
19
+ - **60%+ Cost Reduction**: Typical savings by routing simple queries to free models
20
+
21
+ ## Architecture
22
+
23
+ ```
24
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
25
+ β”‚ Cascade β”‚
26
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
27
+ β”‚ β”‚
28
+ β”‚ Request ──► Semantic Cache ──► Cache Hit? ──► Return β”‚
29
+ β”‚ β”‚ β”‚
30
+ β”‚ β–Ό (miss) β”‚
31
+ β”‚ ML Classifier β”‚
32
+ β”‚ β”‚ β”‚
33
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
34
+ β”‚ β–Ό β–Ό β–Ό β”‚
35
+ β”‚ Simple Medium Complex β”‚
36
+ β”‚ β”‚ β”‚ β”‚ β”‚
37
+ β”‚ β–Ό β–Ό β–Ό β”‚
38
+ β”‚ Llama3.2 GPT-4o-mini GPT-4o β”‚
39
+ β”‚ (free) ($0.15/1M) ($2.50/1M) β”‚
40
+ β”‚ β”‚
41
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
42
+ ```
43
+
44
+ ## Quick Start
45
+
46
+ ### Prerequisites
47
+
48
+ - Python 3.11+
49
+ - Docker & Docker Compose (optional)
50
+ - Ollama (for local models)
51
+ - OpenAI API key
52
+
53
+ ### Installation
54
+
55
+ ```bash
56
+ # Clone the repository
57
+ git clone https://github.com/ayushm98/cascade.git
58
+ cd cascade
59
+
60
+ # Install dependencies
61
+ pip install poetry
62
+ poetry install
63
+
64
+ # Set up environment
65
+ cp .env.example .env
66
+ # Edit .env with your API keys
67
+ ```
68
+
69
+ ### Running with Docker
70
+
71
+ ```bash
72
+ # Start all services
73
+ docker-compose up -d
74
+
75
+ # API available at http://localhost:8000
76
+ # UI available at http://localhost:8501
77
+ ```
78
+
79
+ ### Running Locally
80
+
81
+ ```bash
82
+ # Start the API server
83
+ poetry run uvicorn cascade.api.main:app --reload
84
+
85
+ # Start the Streamlit UI (in another terminal)
86
+ poetry run streamlit run src/cascade/ui/app.py
87
+ ```
88
+
89
+ ## Usage
90
+
91
+ ### API Usage
92
+
93
+ Cascade is OpenAI-compatible. Just change your base URL:
94
+
95
+ ```python
96
+ from openai import OpenAI
97
+
98
+ client = OpenAI(
99
+ base_url="http://localhost:8000/v1",
100
+ api_key="not-needed" # Uses your configured key
101
+ )
102
+
103
+ # Automatic routing based on complexity
104
+ response = client.chat.completions.create(
105
+ model="auto", # Let Cascade choose the best model
106
+ messages=[{"role": "user", "content": "What is 2+2?"}]
107
+ )
108
+ ```
109
+
110
+ ### Forcing a Specific Model
111
+
112
+ ```python
113
+ # Force GPT-4o for complex tasks
114
+ response = client.chat.completions.create(
115
+ model="gpt-4o",
116
+ messages=[{"role": "user", "content": "Write a compiler..."}]
117
+ )
118
+ ```
119
+
120
+ ### Checking Stats
121
+
122
+ ```bash
123
+ curl http://localhost:8000/v1/stats
124
+ ```
125
+
126
+ ```json
127
+ {
128
+ "total_requests": 1247,
129
+ "cost": {
130
+ "actual": 2.34,
131
+ "baseline": 7.89,
132
+ "saved_dollars": 5.55,
133
+ "saved_percentage": 70.3
134
+ },
135
+ "cache": {
136
+ "hit_rate": 42.6
137
+ }
138
+ }
139
+ ```
140
+
141
+ ## Configuration
142
+
143
+ | Environment Variable | Default | Description |
144
+ |---------------------|---------|-------------|
145
+ | `OPENAI_API_KEY` | - | OpenAI API key |
146
+ | `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama server URL |
147
+ | `REDIS_HOST` | `localhost` | Redis host |
148
+ | `QDRANT_URL` | `http://localhost:6333` | Qdrant server URL |
149
+ | `SIMILARITY_THRESHOLD` | `0.92` | Semantic cache threshold |
150
+ | `CACHE_TTL` | `3600` | Cache TTL in seconds |
151
+
152
+ ## Project Structure
153
+
154
+ ```
155
+ cascade/
156
+ β”œβ”€β”€ src/cascade/
157
+ β”‚ β”œβ”€β”€ api/ # FastAPI application
158
+ β”‚ β”œβ”€β”€ cache/ # Redis + Qdrant caching
159
+ β”‚ β”œβ”€β”€ cost/ # Cost tracking & analytics
160
+ β”‚ β”œβ”€β”€ providers/ # LLM provider adapters
161
+ β”‚ β”œβ”€β”€ router/ # ML classifier & routing
162
+ β”‚ └── ui/ # Streamlit dashboard
163
+ β”œβ”€β”€ ml/ # ML training pipeline
164
+ β”‚ β”œβ”€β”€ data/ # Dataset loading
165
+ β”‚ β”œβ”€β”€ training/ # Model training
166
+ β”‚ └── export/ # ONNX conversion
167
+ β”œβ”€β”€ tests/ # Test suite
168
+ └── docker-compose.yml
169
+ ```
170
+
171
+ ## How It Works
172
+
173
+ 1. **Request Arrives**: User sends a chat completion request
174
+ 2. **Cache Check**: Check semantic cache for similar previous queries
175
+ 3. **Complexity Classification**: ML model predicts query complexity (0-1)
176
+ 4. **Routing Decision**:
177
+ - Score < 0.35 β†’ Ollama (free)
178
+ - Score 0.35-0.70 β†’ GPT-4o-mini ($0.15/1M tokens)
179
+ - Score > 0.70 β†’ GPT-4o ($2.50/1M tokens)
180
+ 5. **Response**: Forward to selected model, cache result, return
181
+
182
+ ## Development
183
+
184
+ ```bash
185
+ # Run tests
186
+ poetry run pytest
187
+
188
+ # Run linting
189
+ poetry run ruff check src/
190
+ poetry run black src/
191
+
192
+ # Train the classifier
193
+ python -m ml.training.train --dataset easy2hard --epochs 5
194
+
195
+ # Export to ONNX
196
+ python -m ml.export.convert_to_onnx
197
+ ```
198
+
199
+ ## Contributing
200
+
201
+ Contributions are welcome! Please read our contributing guidelines first.
202
+
203
+ ## License
204
+
205
+ MIT License - see [LICENSE](LICENSE) for details.