rag-api-node-1 / docs /ANALYSIS_THREE.md
Peterase's picture
feat(rag): implement hybrid search with live sources and production-grade intent classification
a63c61f

RAG API Analysis & Critique - Session 3 (Final)

This final session targets deep-level infrastructure bottlenecks, production resilience, and advanced UX patterns for a professional News Pipeline.

1. The Redundancy Bottleneck (Semantic Diversity)

  • Critique: In news, a single event (e.g., "Market Crash") is covered by 50 sources. Semantic search will retrieve 10 chunks from 10 different sources that say the exact same thing.
  • Reason: This fills the 3000-token context window with redundant info, preventing the LLM from seeing "The full picture" or diverse perspectives.
  • Solution: Implement Diversity Filtering (Maximal Marginal Relevance - MMR). Instead of just "top K similarity", select chunks that are similar to the query but dissimilar to each other.

2. Infrastructure Silos (ClickHouse-RAG Fusion)

  • Critique: ClickHouse stores "Trends" and "Sentiment" for thousands of articles, but the RAG pipeline operates as a isolated silo.
  • Reason: The LLM might answer a question about a person without knowing they are "Trending for Negative Sentiment" today.
  • Solution: Inject Global Context Metadata. Before long-form generation, fetch a "Trend Snapshot" for the query's entities from ClickHouse and inject it into the prompt.

3. The "Wait-Time" UX Bottleneck (Streaming)

  • Critique: Currently, the user waits for Retrieval -> Reranking -> Full Generation before seeing any text. This can take 3-5 seconds.
  • Reason: Synchronous JSON responses are the standard for REST, but feel "slow" for chat.
  • Solution: Implement Asynchronous Streaming (Server-Sent Events). Use FastAPI's StreamingResponse to stream tokens as GPT-4 generates them.

4. Production Resilience (Circuit Breakers)

  • Critique: If Qdrant or the local Embedder fails, the /chat endpoint returns a generic error or hangs.
  • Reason: Lack of fallback strategies for critical path components.
  • Solution: Implement Graceful Degradation. If Vector Search fails, fall back to a "Recent Headlines" keyword search in ClickHouse. If GPT-4 fails, return the raw retrieved sources with a "Summary Unavailable" message.

5. Scaling: Index Quantization

  • Critique: As the news corpus reaches millions of articles, Qdrant's RAM usage and search latency will spike due to BGE-M3's large vectors (1024 dim).
  • Reason: Storing full-precision (float32) vectors is expensive.
  • Solution: Enable Scalar Quantization (int8) or Binary Quantization in Qdrant. This reduces RAM usage by 4x-32x with minimal loss in precision.

Final Enhancement Roadmap

Enhancement Reason Solution
Diversity Filter (MMR) Context waste Rerank for novelty, not just similarity.
Streaming Response UX Latency Use SSE to stream LLM tokens.
ClickHouse Insights Hidden Metadata Inject trend data into the prompt.
Circuit Breakers Fault Tolerance Fallback to keyword search on VDB failure.

Implementation Details (Session 3)

As the final phase of this RAG evolution, I have implemented the following "State-of-the-Art" patterns:

1. Diversity Filtering (MMR)

  • Status: Implemented
  • Details: Added apply_mmr and _get_simple_similarity to RerankerService. After the initial Cross-Encoder rerank, the system now runs a Maximal Marginal Relevance pass to ensure that the top documents provide diverse information rather than repeated facts.

2. Streaming Responses (SSE)

  • Status: Implemented
  • Details: Added a new /api/v1/rag/chat/stream endpoint in rag.py. It uses FastAPI's StreamingResponse and LangChain's .stream() method to deliver answer tokens in real-time to the frontend.

3. ClickHouse Trend Fusion

  • Status: Implemented
  • Details: The RAG pipeline now queries the DataWarehouse during the refinement stage. If active trends (entities and sentiment) are found in ClickHouse, they are injected into the LLM prompt, providing the assistant with "Live Context" beyond simple static retrieval.

4. Circuit Breaker Fallbacks

  • Status: Implemented
  • Details: Updated VectorStore.search to handle exceptions. In the event of a Qdrant service failure, the system automatically falls back to fallback_keyword_search in ClickHouse, ensuring the user gets some relevant headlines instead of an error.

5. Index Optimization

  • Recommendation: As the collection grows, enable Product Quantization (PQ) in Qdrant configs. This has been noted in the analysis for future DevOps scaling.