Spaces:

Peterase
/

rag-api-node-1

Running

rag-api-node-1 / docs /ANALYSIS_THREE.md

feat(rag): implement hybrid search with live sources and production-grade intent classification

a63c61f 13 days ago

4.62 kB

RAG API Analysis & Critique - Session 3 (Final)

This final session targets deep-level infrastructure bottlenecks, production resilience, and advanced UX patterns for a professional News Pipeline.

Critique: In news, a single event (e.g., "Market Crash") is covered by 50 sources. Semantic search will retrieve 10 chunks from 10 different sources that say the exact same thing.
Reason: This fills the 3000-token context window with redundant info, preventing the LLM from seeing "The full picture" or diverse perspectives.
Solution: Implement Diversity Filtering (Maximal Marginal Relevance - MMR). Instead of just "top K similarity", select chunks that are similar to the query but dissimilar to each other.

Critique: ClickHouse stores "Trends" and "Sentiment" for thousands of articles, but the RAG pipeline operates as a isolated silo.
Reason: The LLM might answer a question about a person without knowing they are "Trending for Negative Sentiment" today.
Solution: Inject Global Context Metadata. Before long-form generation, fetch a "Trend Snapshot" for the query's entities from ClickHouse and inject it into the prompt.

Critique: Currently, the user waits for Retrieval -> Reranking -> Full Generation before seeing any text. This can take 3-5 seconds.
Reason: Synchronous JSON responses are the standard for REST, but feel "slow" for chat.
Solution: Implement Asynchronous Streaming (Server-Sent Events). Use FastAPI's StreamingResponse to stream tokens as GPT-4 generates them.

Critique: If Qdrant or the local Embedder fails, the /chat endpoint returns a generic error or hangs.
Reason: Lack of fallback strategies for critical path components.
Solution: Implement Graceful Degradation. If Vector Search fails, fall back to a "Recent Headlines" keyword search in ClickHouse. If GPT-4 fails, return the raw retrieved sources with a "Summary Unavailable" message.

Critique: As the news corpus reaches millions of articles, Qdrant's RAM usage and search latency will spike due to BGE-M3's large vectors (1024 dim).
Reason: Storing full-precision (float32) vectors is expensive.
Solution: Enable Scalar Quantization (int8) or Binary Quantization in Qdrant. This reduces RAM usage by 4x-32x with minimal loss in precision.

Enhancement	Reason	Solution
Diversity Filter (MMR)	Context waste	Rerank for novelty, not just similarity.
Streaming Response	UX Latency	Use SSE to stream LLM tokens.
ClickHouse Insights	Hidden Metadata	Inject trend data into the prompt.
Circuit Breakers	Fault Tolerance	Fallback to keyword search on VDB failure.

As the final phase of this RAG evolution, I have implemented the following "State-of-the-Art" patterns:

Status: Implemented
Details: Added apply_mmr and _get_simple_similarity to RerankerService. After the initial Cross-Encoder rerank, the system now runs a Maximal Marginal Relevance pass to ensure that the top documents provide diverse information rather than repeated facts.

Status: Implemented
Details: Added a new /api/v1/rag/chat/stream endpoint in rag.py. It uses FastAPI's StreamingResponse and LangChain's .stream() method to deliver answer tokens in real-time to the frontend.

Status: Implemented
Details: The RAG pipeline now queries the DataWarehouse during the refinement stage. If active trends (entities and sentiment) are found in ClickHouse, they are injected into the LLM prompt, providing the assistant with "Live Context" beyond simple static retrieval.

Status: Implemented
Details: Updated VectorStore.search to handle exceptions. In the event of a Qdrant service failure, the system automatically falls back to fallback_keyword_search in ClickHouse, ensuring the user gets some relevant headlines instead of an error.

Recommendation: As the collection grows, enable Product Quantization (PQ) in Qdrant configs. This has been noted in the analysis for future DevOps scaling.