Spaces:
Running
Running
RAG API Analysis & Critique - Session 3 (Final)
This final session targets deep-level infrastructure bottlenecks, production resilience, and advanced UX patterns for a professional News Pipeline.
1. The Redundancy Bottleneck (Semantic Diversity)
- Critique: In news, a single event (e.g., "Market Crash") is covered by 50 sources. Semantic search will retrieve 10 chunks from 10 different sources that say the exact same thing.
- Reason: This fills the 3000-token context window with redundant info, preventing the LLM from seeing "The full picture" or diverse perspectives.
- Solution: Implement Diversity Filtering (Maximal Marginal Relevance - MMR). Instead of just "top K similarity", select chunks that are similar to the query but dissimilar to each other.
2. Infrastructure Silos (ClickHouse-RAG Fusion)
- Critique: ClickHouse stores "Trends" and "Sentiment" for thousands of articles, but the RAG pipeline operates as a isolated silo.
- Reason: The LLM might answer a question about a person without knowing they are "Trending for Negative Sentiment" today.
- Solution: Inject Global Context Metadata. Before long-form generation, fetch a "Trend Snapshot" for the query's entities from ClickHouse and inject it into the prompt.
3. The "Wait-Time" UX Bottleneck (Streaming)
- Critique: Currently, the user waits for Retrieval -> Reranking -> Full Generation before seeing any text. This can take 3-5 seconds.
- Reason: Synchronous JSON responses are the standard for REST, but feel "slow" for chat.
- Solution: Implement Asynchronous Streaming (Server-Sent Events). Use FastAPI's
StreamingResponseto stream tokens as GPT-4 generates them.
4. Production Resilience (Circuit Breakers)
- Critique: If Qdrant or the local Embedder fails, the
/chatendpoint returns a generic error or hangs. - Reason: Lack of fallback strategies for critical path components.
- Solution: Implement Graceful Degradation. If Vector Search fails, fall back to a "Recent Headlines" keyword search in ClickHouse. If GPT-4 fails, return the raw retrieved sources with a "Summary Unavailable" message.
5. Scaling: Index Quantization
- Critique: As the news corpus reaches millions of articles, Qdrant's RAM usage and search latency will spike due to BGE-M3's large vectors (1024 dim).
- Reason: Storing full-precision (float32) vectors is expensive.
- Solution: Enable Scalar Quantization (int8) or Binary Quantization in Qdrant. This reduces RAM usage by 4x-32x with minimal loss in precision.
Final Enhancement Roadmap
| Enhancement | Reason | Solution |
|---|---|---|
| Diversity Filter (MMR) | Context waste | Rerank for novelty, not just similarity. |
| Streaming Response | UX Latency | Use SSE to stream LLM tokens. |
| ClickHouse Insights | Hidden Metadata | Inject trend data into the prompt. |
| Circuit Breakers | Fault Tolerance | Fallback to keyword search on VDB failure. |
Implementation Details (Session 3)
As the final phase of this RAG evolution, I have implemented the following "State-of-the-Art" patterns:
1. Diversity Filtering (MMR)
- Status: Implemented
- Details: Added
apply_mmrand_get_simple_similaritytoRerankerService. After the initial Cross-Encoder rerank, the system now runs a Maximal Marginal Relevance pass to ensure that the top documents provide diverse information rather than repeated facts.
2. Streaming Responses (SSE)
- Status: Implemented
- Details: Added a new
/api/v1/rag/chat/streamendpoint inrag.py. It uses FastAPI'sStreamingResponseand LangChain's.stream()method to deliver answer tokens in real-time to the frontend.
3. ClickHouse Trend Fusion
- Status: Implemented
- Details: The RAG pipeline now queries the
DataWarehouseduring the refinement stage. If active trends (entities and sentiment) are found in ClickHouse, they are injected into the LLM prompt, providing the assistant with "Live Context" beyond simple static retrieval.
4. Circuit Breaker Fallbacks
- Status: Implemented
- Details: Updated
VectorStore.searchto handle exceptions. In the event of a Qdrant service failure, the system automatically falls back tofallback_keyword_searchin ClickHouse, ensuring the user gets some relevant headlines instead of an error.
5. Index Optimization
- Recommendation: As the collection grows, enable Product Quantization (PQ) in Qdrant configs. This has been noted in the analysis for future DevOps scaling.