Spaces:
Running
State-of-the-Art (SOTA) RAG Retrieval Data Flow
This document details the end-to-end data flow of the News Pipeline RAG API, incorporating advanced patterns for accuracy, diversity, and production resilience.
1. Pre-Processing & Infrastructure (The "Cold-Start" Layer)
To ensure zero-latency during the initial user interaction, the system implements a preemptive resource loading strategy.
A. Async Pre-warming (Hidden Latency Absorption)
- Challenge: Large Transformer models (like BGE-M3 and Cross-Encoders) typically take 5–15 seconds to load from disk to RAM/VRAM. Lazy-loading these on the first request creates an unacceptable user experience.
- Process:
- In
main.py, the@app.on_event("startup")hook triggers a non-blockingthreading.Thread. - This background thread immediately initializes
EmbedderServiceandRerankerService. - By the time the web server is live and the user types their first query, the models are fully resident in memory, resulting in sub-second response times for the very first request.
- In
B. Circuit Breaker: ClickHouse Fallback (Always-On Reliability)
- Challenge: Vector databases like Qdrant can occasionally experience network partitions or downtime. In a naive RAG, this would crash the conversation.
- Process:
- The
VectorStore.searchmethod is wrapped in a robusttry-exceptblock. - If the Qdrant client connection fails or a timeout occurs, the Circuit Breaker trips.
- The system automatically redirects the query to
fallback_keyword_search()in ClickHouse. - Mechanism: It performs a rapid SQL-based keyword search on titles and content in the
sentiment_resultstable. While less semantically accurate than vectors, it ensures the user receives actual relevant news articles instead of a "Service Unavailable" error.
- The
2. Request Phase (Conversational Logic)
Step A: Query Transformation (Contextual Synthesis)
Purpose: Bridging the gap between human conversation and vector search requirements.
- The Problem: Users often ask relative questions like "What about their stock?". Vector databases cannot resolve "their" without context.
- Process:
- The API retrieves the last 6 messages from PostgreSQL.
- A specialized prompt instructs
GPT-4to synthesize the conversation history and the new user query into a single Standalone Search Query. - If history is empty, the original query is used.
- Example Trace:
- History:
User: Tell me about Nvidia's revenue last year. - New Query:
User: Did Intel do better? - Synthesized Search Query: "Comparison of Intel and Nvidia's revenue for the last fiscal year"
- History:
Step B: Intent-Based Search (Hybrid & Recency)
Purpose: Combining semantic depth with keyword precision and news freshness.
1. Hybrid Vector Synthesis
- Dense Layer: Uses
BAAI/bge-m3to produce a 1024-dimensional semantic embedding. This handles "vibe" and "concept" matching (e.g., matching "financial struggle" to "bankruptcy"). - Sparse Layer: Prepares slots for keyword-specific vectors (e.g., Splade or BGE-M3 Sparse). This handles exact entities, ticker symbols (e.g., "NVDA"), or specific dates that dense embeddings might blur.
2. Temporal Decay (Recency Boosting)
- Logic: News is a deteriorating asset. The system applies a Recency Multiplier during the retrieval collection phase.
- Formula:
Score = Base_Similarity * (1.0 - (days_old / 60)). - Constraint: The multiplier never drops below
0.5, ensuring that very relevant historical news is still retrievable but newer coverage is naturally prioritized. - Example:
- Article A (Identical match, 60 days old):
Final Score = 0.9 * 0.5 = 0.45 - Article B (Close match, today):
Final Score = 0.8 * 1.0 = 0.8 - Result: Article B is ranked higher despite slightly lower semantic similarity.
- Article A (Identical match, 60 days old):
3. Retrieval Refinement (The "Precision" Layer)
Step C: Cross-Encoder Reranking (Relevance Grading)
Purpose: Moving from "Bi-Encoder" (fast but broad) to "Cross-Encoder" (slow but highly accurate).
- The Problem: Dense embeddings (Bi-Encoders) are great at finding "similar" text but often struggle with fine-grained nuances or contradictory statements.
- Process:
- The system takes the Top 20 results from the broad search.
- Each [Query, Chunk] pair is passed through the
CrossEncodermodel (ms-marco-TinyBERT-L-2-v2). - The model produces a raw relevance score. This is significantly more accurate than pure cosine similarity from the vector search.
Step D: Diversity Filtering - MMR (Information Density)
Purpose: Preventing "Echo Chambers" or redundant context windows.
- The Problem: Five news articles starting with the same AP wire sentence will fill the LLM context with redundant text.
- Process:
- Implemented Maximal Marginal Relevance (MMR).
- Logic selects documents that have high relevance but low similarity to already selected documents.
- Example:
- Selection 1: A factual report of a merger.
- Selection 2 (Rejected): Another factual report of the same merger.
- Selection 2 (Accepted): A financial analyst's opinion on the same merger.
Step E: Parent Document Retrieval (Context Expansion)
Purpose: Providing the "Full Picture" when a snippet isn't enough.
- Process:
- Small chunks (~500 chars) are indexed for surgical search accuracy.
- If a chunk's rerank score is > 0.8, its unique
doc_idis used to fetch the full parent article body from ClickHouse/Qdrant. - This allows the LLM to see the surrounding context that might have been lost in the chunking process.
4. Generation & Enrichment
Step F: ClickHouse Trend Fusion (External Intelligence)
Purpose: Grounding the LLM in real-time metadata.
- Process:
- Parallel to the LLM call, the system queries the ClickHouse Data Warehouse.
- It extracts trending entities and sentiment scores for the last 3 days relevant to the query.
- This "Trend Knowledge" is injected into the system prompt.
- Benefit: The LLM can say: "Retrieval articles show X, but ClickHouse trends show that sentiment for this topic is currently shifting negative."
Step G: Streaming Generation - SSE (Real-Time UX)
Purpose: Minimizing "Perceived Latency".
- Process:
- Uses FastAPI
StreamingResponseand Server-Sent Events (SSE). - Instead of waiting 5 seconds for a full paragraph, the first token is displayed within 200-400ms.
- Tokens are pushed to the client in real-time as the LLM predicts them.
- Uses FastAPI
5. Traceability & Feedback Loop
Step H: Interaction Logging (Audit Trail)
- Traceability: Every AI response logs the exact list of
retrieved_doc_ids(Source IDs) in PostgreSQL. - Learning Loop: When a user gives a "Thumbs Down", developers can query the database to see exactly which sources were used. This allows for Negative Sampling (identifying which articles cause hallucination or bad answers).
Technical Stack Overview
| Stage | Tool/Model |
|---|---|
| Embeddings | BAAI/bge-m3 (BAAI) |
| Reranking | ms-marco-TinyBERT-L-2-v2 (CrossEncoder) |
| Diversity | Custom MMR Implementation |
| Vector DB | Qdrant |
| Data Warehouse | ClickHouse |
| Token Control | tiktoken (cl100k_base) |
| LLM | OpenAI gpt-4 |
Full Data Flow Visual
graph TD
User((User)) -->|Query| API[RAG API]
API -->|Prompt| LLM_Rewriter[LLM Rewriter]
LLM_Rewriter -->|Standalone Query| API
API -.->|Circuit Breaker Check| VDB{Qdrant Online?}
VDB -->|No| CH_FB[ClickHouse Keyword Fallback]
VDB -->|Yes| V_Search[Hybrid Vector Search]
V_Search -->|Top 20| Rerank[Cross-Encoder Reranker]
Rerank -->|Diversity Pass| MMR[MMR Filter]
MMR -->|Top K| Parent_Fetch[Parent Doc Retrieval]
Parent_Fetch -->|Context| Prompt_Build[Prompt Construction]
Prompt_Build -->|Inject| CH_Trends[ClickHouse Trends]
CH_Trends -->|Full Prompt| LLM_Stream[LLM Streaming]
LLM_Stream -->|SSE Tokens| User
LLM_Stream -->|Trace| Postgres[(Interaction DB)]