ayushm98 commited on
Commit
7b1b1fd
·
1 Parent(s): c0e5180

docs: add core features implementation plan

Browse files

- Document planned improvements for streaming, retry logic, rate limiting
- Outline ML classifier training strategy
- Define success metrics and implementation order

Files changed (1) hide show
  1. CORE_FEATURES_PLAN.md +90 -0
CORE_FEATURES_PLAN.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core Features Implementation Plan
2
+
3
+ This document outlines the core features being added to Cascade on Dec 29, 2025.
4
+
5
+ ## Current State
6
+
7
+ ✅ **Working:**
8
+ - Heuristic-based routing (keyword matching + query length)
9
+ - Exact-match caching (Redis)
10
+ - Semantic caching (Qdrant + embeddings)
11
+ - Cost tracking and analytics
12
+ - OpenAI & Ollama provider support
13
+ - Streamlit UI dashboard
14
+
15
+ ❌ **Missing:**
16
+ - Streaming responses (SSE)
17
+ - Retry logic with fallbacks
18
+ - Rate limiting
19
+ - Trained ML classifier (ONNX model)
20
+
21
+ ## Planned Improvements
22
+
23
+ ### 1. Streaming Support
24
+ **Goal:** Add SSE (Server-Sent Events) streaming for real-time token generation
25
+
26
+ **Implementation:**
27
+ - Add `stream_complete()` method to `LLMProvider` base class
28
+ - Implement streaming in OpenAI and Ollama providers
29
+ - Add streaming endpoint `/v1/chat/completions` with `stream=true` parameter
30
+ - Return SSE format: `data: {json}\n\n`
31
+
32
+ **Benefits:**
33
+ - Better UX - users see tokens as they're generated
34
+ - Lower perceived latency
35
+ - Standard OpenAI API compatibility
36
+
37
+ ### 2. Retry Logic with Fallbacks
38
+ **Goal:** Automatic retries and model fallbacks on failures
39
+
40
+ **Implementation:**
41
+ - Add retry decorator with exponential backoff (tenacity library)
42
+ - Implement fallback chain: `gpt-4o` → `gpt-4o-mini` → `gpt-3.5-turbo`
43
+ - Handle rate limits, timeouts, and provider errors
44
+ - Track retry/fallback metrics
45
+
46
+ **Benefits:**
47
+ - Higher reliability
48
+ - Graceful degradation
49
+ - Better error handling
50
+
51
+ ### 3. Rate Limiting
52
+ **Goal:** Prevent abuse and manage costs
53
+
54
+ **Implementation:**
55
+ - Add `slowapi` middleware for request rate limiting
56
+ - Implement token bucket algorithm
57
+ - Configurable limits per IP/API key
58
+ - Return `429 Too Many Requests` with `Retry-After` header
59
+
60
+ **Limits:**
61
+ - Default: 100 requests/minute per IP
62
+ - Configurable via environment variables
63
+
64
+ ### 4. ML Classifier Training
65
+ **Goal:** Replace heuristics with trained DistilBERT model
66
+
67
+ **Implementation:**
68
+ - Create training dataset from existing queries
69
+ - Fine-tune DistilBERT for 3-class classification (simple/medium/complex)
70
+ - Convert to ONNX for fast inference
71
+ - Add fallback to heuristics if model unavailable
72
+
73
+ **Benefits:**
74
+ - More accurate routing decisions
75
+ - Better cost optimization
76
+ - Learns from actual query patterns
77
+
78
+ ## Implementation Order
79
+
80
+ 1. **Streaming** - High impact, moderate complexity
81
+ 2. **Retry logic** - Critical for reliability
82
+ 3. **Rate limiting** - Quick win for production readiness
83
+ 4. **ML classifier** - Lower priority, can use heuristics for now
84
+
85
+ ## Success Metrics
86
+
87
+ - Streaming: Response time perceived latency < 500ms
88
+ - Retry: 99.9% request success rate
89
+ - Rate limiting: Zero abuse incidents
90
+ - ML classifier: >85% accuracy vs manual labels