Spaces:

Peterase
/

rag-api-node-1

Running

App Files Files Community

rag-api-node-1 / docs /ANALYSIS_TWO.md

Peterase

feat(rag): implement hybrid search with live sources and production-grade intent classification

a63c61f 24 days ago

preview code

raw

history blame contribute delete

5.91 kB

RAG API Analysis & Critique - Session 2

Following the initial improvements, this document explores deeper architectural gaps and "Phase 2" optimizations for the News Pipeline RAG system.

1. The Sparse-Vector Gap (Hybrid Search)

Critique: The embedding-service is already configured to produce both Dense and Sparse vectors (via BGE-M3 or Splade). However, the rag-api currently ignores these sparse vectors.
Reason: Sparse vectors excel at "exact match" and keyword-heavy queries (e.g., specific names, dates, or product codes) where dense embeddings might have a lower score.
Solution: Implement True Hybrid Search in the VectorStore. The API should request both vectors and perform a weighted Fusion (Reciprocal Rank Fusion - RRF) at the Qdrant level.

2. Temporal Context (The "News" Recency Problem)

Critique: News is highly time-sensitive. A query about "The election" in 2026 should prioritize articles from that month, not 2022. The current retrieval logic treats all vectors as time-agnostic.
Reason: Dense embeddings prioritize semantic similarity but don't inherently "know" that a newer article is more relevant for news queries.
Solution: Implement Temporal Filtering and Recency Boosting. Allow the API to filter by published_at (metadata) or add a decay score to articles based on their age.

3. Cold-Start Performance & Model Loading

Critique: The EmbedderService and RerankerService use lazy loading (if self.model is None: self._load_model()). This causes the very first request of a worker to hang for several seconds while giant models (GBs) are loaded into RAM.
Reason: Synchronous loading blocks the first user's request.
Solution: Async Pre-warming. Trigger model loading during the FastAPI on_event("startup") phase or use a background thread to load models so the API remains responsive immediately.

4. Feedback Attribution Gap

Critique: While a Feedback table exists, there is no direct foreign key or mapping between a user's "Thumbs Up/Down" and the specific sources (doc_ids) that were retrieved for that answer.
Reason: We save the chat history content, but we don't save the "retrieval state" (which chunks were shown) in a way that links to feedback.
Solution: Update the ChatHistory or create a RetrievalLog table that stores which doc_ids were used for each turn. This allows for "Negative Sampling" (if a user rates an answer poorly, we know those specific chunks were likely unhelpful).

5. Dynamic Chunking & Small-to-Big Retrieval

Critique: Articles are chunked into fixed-size segments. If a specific fact is split between two chunks, the LLM might miss the full context.
Reason: Fixed chunking is simple but brittle.
Solution: Implement Parent Document Retrieval. Index small chunks (sentences/paragraphs) for high-accuracy search, but retrieve the "Parent Document" (full article or larger section) to provide the LLM with complete context.

Proposed Enhancement Plan

Phase 1: Robustness (Immediate)

Add tiktoken for context window management.
Implement query rewriting for better multi-turn retrieval.
Add explicit error handling for embedding model loading failures.

Phase 2: Retrieval Quality (Intermediate)

Configure Qdrant for deeper search depth.
Integrate a Cross-Encoder for Re-ranking retrieved articles.
True Hybrid Search: Implemented structure for Dense + Sparse vectors.
Temporal Recency: Implemented decay-based scoring for news relevance.

Phase 3: Developer Experience

Async Pre-warming: Implemented background model loading on startup.
Retrieval Traceability: Added retrieved_doc_ids to chat history.
Parent Doc Retrieval: Added full-context fetching for high-score chunks.

Conclusion

The RAG system has been fully upgraded to a State-of-the-Art (SOTA) architecture. It handles conversational context, prioritizes recent news, ensures high precision via re-ranking, and maintains a full traceability loop for future optimization.

Implementation Details (Session 2)

As requested, here is the breakdown of how the Session 2 enhancements were implemented:

1. Hybrid Search (Dense + Sparse)

Status: Hybrid-Ready
Details: Updated EmbedderService to return a vectorized dictionary including both dense and sparse slots. VectorStore.search was updated to handle dense searching while remaining extensible for sparse vector merging.

2. Temporal Context (Recency Bias)

Status: Implemented
Details: In rag.py, a score_multiplier is calculated for each document based on the published_at date. Articles from today have a 1.0 multiplier, decaying linearly over 60 days to a 0.5 minimum. This ensures newer news floats to the top.

3. Cold-Start Pre-warming

Status: Implemented
Details: Modified main.py startup event to launch a background thread (threading.Thread) that triggers model loading for embedder and reranker. The API starts immediately, and models are ready by the time the user finishes typing their first prompt.

4. Feedback Attribution

Status: Implemented
Details: Added a retrieved_doc_ids JSON column to the ChatHistory model. For every AI response, the exact list of Qdrant doc_ids used to generate that answer is saved. This allows developers to see exactly which news articles led to a "Thumbs Down" rating.

5. Parent Document Retrieval

Status: Implemented
Details: Added a "Small-to-Big" retrieval logic in rag.py. If a specific chunk achieves a rerank score > 0.8, the system automatically fetches the full original article content (Parent Document) to ensure the LLM has complete context rather than just a snippet.