# State-of-the-Art (SOTA) RAG Retrieval Data Flow This document details the end-to-end data flow of the News Pipeline RAG API, incorporating advanced patterns for accuracy, diversity, and production resilience. ## 1. Pre-Processing & Infrastructure (The "Cold-Start" Layer) To ensure **zero-latency** during the initial user interaction, the system implements a preemptive resource loading strategy. ### A. Async Pre-warming (Hidden Latency Absorption) - **Challenge**: Large Transformer models (like BGE-M3 and Cross-Encoders) typically take 5–15 seconds to load from disk to RAM/VRAM. Lazy-loading these on the first request creates an unacceptable user experience. - **Process**: - In `main.py`, the `@app.on_event("startup")` hook triggers a non-blocking `threading.Thread`. - This background thread immediately initializes `EmbedderService` and `RerankerService`. - By the time the web server is live and the user types their first query, the models are fully resident in memory, resulting in sub-second response times for the very first request. ### B. Circuit Breaker: ClickHouse Fallback (Always-On Reliability) - **Challenge**: Vector databases like Qdrant can occasionally experience network partitions or downtime. In a naive RAG, this would crash the conversation. - **Process**: - The `VectorStore.search` method is wrapped in a robust `try-except` block. - If the Qdrant client connection fails or a timeout occurs, the **Circuit Breaker** trips. - The system automatically redirects the query to `fallback_keyword_search()` in ClickHouse. - **Mechanism**: It performs a rapid SQL-based keyword search on titles and content in the `sentiment_results` table. While less semantically accurate than vectors, it ensures the user receives actual relevant news articles instead of a "Service Unavailable" error. ## 2. Request Phase (Conversational Logic) ### Step A: Query Transformation (Contextual Synthesis) **Purpose**: Bridging the gap between human conversation and vector search requirements. - **The Problem**: Users often ask relative questions like *"What about their stock?"*. Vector databases cannot resolve "their" without context. - **Process**: - The API retrieves the last 6 messages from PostgreSQL. - A specialized prompt instructs `GPT-4` to synthesize the conversation history and the new user query into a single **Standalone Search Query**. - If history is empty, the original query is used. - **Example Trace**: - **History**: `User: Tell me about Nvidia's revenue last year.` - **New Query**: `User: Did Intel do better?` - **Synthesized Search Query**: *"Comparison of Intel and Nvidia's revenue for the last fiscal year"* ### Step B: Intent-Based Search (Hybrid & Recency) **Purpose**: Combining semantic depth with keyword precision and news freshness. #### 1. Hybrid Vector Synthesis - **Dense Layer**: Uses `BAAI/bge-m3` to produce a 1024-dimensional semantic embedding. This handles "vibe" and "concept" matching (e.g., matching "financial struggle" to "bankruptcy"). - **Sparse Layer**: Prepares slots for keyword-specific vectors (e.g., Splade or BGE-M3 Sparse). This handles exact entities, ticker symbols (e.g., "NVDA"), or specific dates that dense embeddings might blur. #### 2. Temporal Decay (Recency Boosting) - **Logic**: News is a deteriorating asset. The system applies a **Recency Multiplier** during the retrieval collection phase. - **Formula**: `Score = Base_Similarity * (1.0 - (days_old / 60))`. - **Constraint**: The multiplier never drops below `0.5`, ensuring that very relevant historical news is still retrievable but newer coverage is naturally prioritized. - **Example**: - Article A (Identical match, 60 days old): `Final Score = 0.9 * 0.5 = 0.45` - Article B (Close match, today): `Final Score = 0.8 * 1.0 = 0.8` - **Result**: Article B is ranked higher despite slightly lower semantic similarity. ## 3. Retrieval Refinement (The "Precision" Layer) ### Step C: Cross-Encoder Reranking (Relevance Grading) **Purpose**: Moving from "Bi-Encoder" (fast but broad) to "Cross-Encoder" (slow but highly accurate). - **The Problem**: Dense embeddings (Bi-Encoders) are great at finding "similar" text but often struggle with fine-grained nuances or contradictory statements. - **Process**: - The system takes the **Top 20** results from the broad search. - Each [Query, Chunk] pair is passed through the `CrossEncoder` model (`ms-marco-TinyBERT-L-2-v2`). - The model produces a raw relevance score. This is significantly more accurate than pure cosine similarity from the vector search. ### Step D: Diversity Filtering - MMR (Information Density) **Purpose**: Preventing "Echo Chambers" or redundant context windows. - **The Problem**: Five news articles starting with the same AP wire sentence will fill the LLM context with redundant text. - **Process**: - Implemented **Maximal Marginal Relevance (MMR)**. - Logic selects documents that have high relevance but **low similarity** to already selected documents. - **Example**: - *Selection 1*: A factual report of a merger. - *Selection 2 (Rejected)*: Another factual report of the same merger. - *Selection 2 (Accepted)*: A financial analyst's opinion on the same merger. ### Step E: Parent Document Retrieval (Context Expansion) **Purpose**: Providing the "Full Picture" when a snippet isn't enough. - **Process**: - Small chunks (~500 chars) are indexed for surgical search accuracy. - If a chunk's rerank score is **> 0.8**, its unique `doc_id` is used to fetch the full parent article body from ClickHouse/Qdrant. - This allows the LLM to see the surrounding context that might have been lost in the chunking process. --- ## 4. Generation & Enrichment ### Step F: ClickHouse Trend Fusion (External Intelligence) **Purpose**: Grounding the LLM in real-time metadata. - **Process**: - Parallel to the LLM call, the system queries the **ClickHouse Data Warehouse**. - It extracts trending entities and sentiment scores for the last 3 days relevant to the query. - This "Trend Knowledge" is injected into the system prompt. - **Benefit**: The LLM can say: *"Retrieval articles show X, but ClickHouse trends show that sentiment for this topic is currently shifting negative."* ### Step G: Streaming Generation - SSE (Real-Time UX) **Purpose**: Minimizing "Perceived Latency". - **Process**: - Uses FastAPI `StreamingResponse` and Server-Sent Events (SSE). - Instead of waiting 5 seconds for a full paragraph, the first token is displayed within **200-400ms**. - Tokens are pushed to the client in real-time as the LLM predicts them. --- ## 5. Traceability & Feedback Loop ### Step H: Interaction Logging (Audit Trail) - **Traceability**: Every AI response logs the exact list of `retrieved_doc_ids` (Source IDs) in PostgreSQL. - **Learning Loop**: When a user gives a "Thumbs Down", developers can query the database to see exactly which sources were used. This allows for **Negative Sampling** (identifying which articles cause hallucination or bad answers). --- ## Technical Stack Overview | Stage | Tool/Model | | :--- | :--- | | **Embeddings** | `BAAI/bge-m3` (BAAI) | | **Reranking** | `ms-marco-TinyBERT-L-2-v2` (CrossEncoder) | | **Diversity** | Custom MMR Implementation | | **Vector DB** | Qdrant | | **Data Warehouse**| ClickHouse | | **Token Control** | `tiktoken` (cl100k_base) | | **LLM** | OpenAI `gpt-4` | --- ## Full Data Flow Visual ```mermaid graph TD User((User)) -->|Query| API[RAG API] API -->|Prompt| LLM_Rewriter[LLM Rewriter] LLM_Rewriter -->|Standalone Query| API API -.->|Circuit Breaker Check| VDB{Qdrant Online?} VDB -->|No| CH_FB[ClickHouse Keyword Fallback] VDB -->|Yes| V_Search[Hybrid Vector Search] V_Search -->|Top 20| Rerank[Cross-Encoder Reranker] Rerank -->|Diversity Pass| MMR[MMR Filter] MMR -->|Top K| Parent_Fetch[Parent Doc Retrieval] Parent_Fetch -->|Context| Prompt_Build[Prompt Construction] Prompt_Build -->|Inject| CH_Trends[ClickHouse Trends] CH_Trends -->|Full Prompt| LLM_Stream[LLM Streaming] LLM_Stream -->|SSE Tokens| User LLM_Stream -->|Trace| Postgres[(Interaction DB)] ```