docs: add core features implementation plan
Browse files- Document planned improvements for streaming, retry logic, rate limiting
- Outline ML classifier training strategy
- Define success metrics and implementation order
- CORE_FEATURES_PLAN.md +90 -0
CORE_FEATURES_PLAN.md
ADDED
|
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Core Features Implementation Plan
|
| 2 |
+
|
| 3 |
+
This document outlines the core features being added to Cascade on Dec 29, 2025.
|
| 4 |
+
|
| 5 |
+
## Current State
|
| 6 |
+
|
| 7 |
+
✅ **Working:**
|
| 8 |
+
- Heuristic-based routing (keyword matching + query length)
|
| 9 |
+
- Exact-match caching (Redis)
|
| 10 |
+
- Semantic caching (Qdrant + embeddings)
|
| 11 |
+
- Cost tracking and analytics
|
| 12 |
+
- OpenAI & Ollama provider support
|
| 13 |
+
- Streamlit UI dashboard
|
| 14 |
+
|
| 15 |
+
❌ **Missing:**
|
| 16 |
+
- Streaming responses (SSE)
|
| 17 |
+
- Retry logic with fallbacks
|
| 18 |
+
- Rate limiting
|
| 19 |
+
- Trained ML classifier (ONNX model)
|
| 20 |
+
|
| 21 |
+
## Planned Improvements
|
| 22 |
+
|
| 23 |
+
### 1. Streaming Support
|
| 24 |
+
**Goal:** Add SSE (Server-Sent Events) streaming for real-time token generation
|
| 25 |
+
|
| 26 |
+
**Implementation:**
|
| 27 |
+
- Add `stream_complete()` method to `LLMProvider` base class
|
| 28 |
+
- Implement streaming in OpenAI and Ollama providers
|
| 29 |
+
- Add streaming endpoint `/v1/chat/completions` with `stream=true` parameter
|
| 30 |
+
- Return SSE format: `data: {json}\n\n`
|
| 31 |
+
|
| 32 |
+
**Benefits:**
|
| 33 |
+
- Better UX - users see tokens as they're generated
|
| 34 |
+
- Lower perceived latency
|
| 35 |
+
- Standard OpenAI API compatibility
|
| 36 |
+
|
| 37 |
+
### 2. Retry Logic with Fallbacks
|
| 38 |
+
**Goal:** Automatic retries and model fallbacks on failures
|
| 39 |
+
|
| 40 |
+
**Implementation:**
|
| 41 |
+
- Add retry decorator with exponential backoff (tenacity library)
|
| 42 |
+
- Implement fallback chain: `gpt-4o` → `gpt-4o-mini` → `gpt-3.5-turbo`
|
| 43 |
+
- Handle rate limits, timeouts, and provider errors
|
| 44 |
+
- Track retry/fallback metrics
|
| 45 |
+
|
| 46 |
+
**Benefits:**
|
| 47 |
+
- Higher reliability
|
| 48 |
+
- Graceful degradation
|
| 49 |
+
- Better error handling
|
| 50 |
+
|
| 51 |
+
### 3. Rate Limiting
|
| 52 |
+
**Goal:** Prevent abuse and manage costs
|
| 53 |
+
|
| 54 |
+
**Implementation:**
|
| 55 |
+
- Add `slowapi` middleware for request rate limiting
|
| 56 |
+
- Implement token bucket algorithm
|
| 57 |
+
- Configurable limits per IP/API key
|
| 58 |
+
- Return `429 Too Many Requests` with `Retry-After` header
|
| 59 |
+
|
| 60 |
+
**Limits:**
|
| 61 |
+
- Default: 100 requests/minute per IP
|
| 62 |
+
- Configurable via environment variables
|
| 63 |
+
|
| 64 |
+
### 4. ML Classifier Training
|
| 65 |
+
**Goal:** Replace heuristics with trained DistilBERT model
|
| 66 |
+
|
| 67 |
+
**Implementation:**
|
| 68 |
+
- Create training dataset from existing queries
|
| 69 |
+
- Fine-tune DistilBERT for 3-class classification (simple/medium/complex)
|
| 70 |
+
- Convert to ONNX for fast inference
|
| 71 |
+
- Add fallback to heuristics if model unavailable
|
| 72 |
+
|
| 73 |
+
**Benefits:**
|
| 74 |
+
- More accurate routing decisions
|
| 75 |
+
- Better cost optimization
|
| 76 |
+
- Learns from actual query patterns
|
| 77 |
+
|
| 78 |
+
## Implementation Order
|
| 79 |
+
|
| 80 |
+
1. **Streaming** - High impact, moderate complexity
|
| 81 |
+
2. **Retry logic** - Critical for reliability
|
| 82 |
+
3. **Rate limiting** - Quick win for production readiness
|
| 83 |
+
4. **ML classifier** - Lower priority, can use heuristics for now
|
| 84 |
+
|
| 85 |
+
## Success Metrics
|
| 86 |
+
|
| 87 |
+
- Streaming: Response time perceived latency < 500ms
|
| 88 |
+
- Retry: 99.9% request success rate
|
| 89 |
+
- Rate limiting: Zero abuse incidents
|
| 90 |
+
- ML classifier: >85% accuracy vs manual labels
|