rag-api-node-1 / docs /RAG_RETRIEVAL_FLOW.md
Peterase's picture
feat(rag): implement hybrid search with live sources and production-grade intent classification
a63c61f

State-of-the-Art (SOTA) RAG Retrieval Data Flow

This document details the end-to-end data flow of the News Pipeline RAG API, incorporating advanced patterns for accuracy, diversity, and production resilience.

1. Pre-Processing & Infrastructure (The "Cold-Start" Layer)

To ensure zero-latency during the initial user interaction, the system implements a preemptive resource loading strategy.

A. Async Pre-warming (Hidden Latency Absorption)

  • Challenge: Large Transformer models (like BGE-M3 and Cross-Encoders) typically take 5–15 seconds to load from disk to RAM/VRAM. Lazy-loading these on the first request creates an unacceptable user experience.
  • Process:
    • In main.py, the @app.on_event("startup") hook triggers a non-blocking threading.Thread.
    • This background thread immediately initializes EmbedderService and RerankerService.
    • By the time the web server is live and the user types their first query, the models are fully resident in memory, resulting in sub-second response times for the very first request.

B. Circuit Breaker: ClickHouse Fallback (Always-On Reliability)

  • Challenge: Vector databases like Qdrant can occasionally experience network partitions or downtime. In a naive RAG, this would crash the conversation.
  • Process:
    • The VectorStore.search method is wrapped in a robust try-except block.
    • If the Qdrant client connection fails or a timeout occurs, the Circuit Breaker trips.
    • The system automatically redirects the query to fallback_keyword_search() in ClickHouse.
    • Mechanism: It performs a rapid SQL-based keyword search on titles and content in the sentiment_results table. While less semantically accurate than vectors, it ensures the user receives actual relevant news articles instead of a "Service Unavailable" error.

2. Request Phase (Conversational Logic)

Step A: Query Transformation (Contextual Synthesis)

Purpose: Bridging the gap between human conversation and vector search requirements.

  • The Problem: Users often ask relative questions like "What about their stock?". Vector databases cannot resolve "their" without context.
  • Process:
    • The API retrieves the last 6 messages from PostgreSQL.
    • A specialized prompt instructs GPT-4 to synthesize the conversation history and the new user query into a single Standalone Search Query.
    • If history is empty, the original query is used.
  • Example Trace:
    • History: User: Tell me about Nvidia's revenue last year.
    • New Query: User: Did Intel do better?
    • Synthesized Search Query: "Comparison of Intel and Nvidia's revenue for the last fiscal year"

Step B: Intent-Based Search (Hybrid & Recency)

Purpose: Combining semantic depth with keyword precision and news freshness.

1. Hybrid Vector Synthesis

  • Dense Layer: Uses BAAI/bge-m3 to produce a 1024-dimensional semantic embedding. This handles "vibe" and "concept" matching (e.g., matching "financial struggle" to "bankruptcy").
  • Sparse Layer: Prepares slots for keyword-specific vectors (e.g., Splade or BGE-M3 Sparse). This handles exact entities, ticker symbols (e.g., "NVDA"), or specific dates that dense embeddings might blur.

2. Temporal Decay (Recency Boosting)

  • Logic: News is a deteriorating asset. The system applies a Recency Multiplier during the retrieval collection phase.
  • Formula: Score = Base_Similarity * (1.0 - (days_old / 60)).
  • Constraint: The multiplier never drops below 0.5, ensuring that very relevant historical news is still retrievable but newer coverage is naturally prioritized.
  • Example:
    • Article A (Identical match, 60 days old): Final Score = 0.9 * 0.5 = 0.45
    • Article B (Close match, today): Final Score = 0.8 * 1.0 = 0.8
    • Result: Article B is ranked higher despite slightly lower semantic similarity.

3. Retrieval Refinement (The "Precision" Layer)

Step C: Cross-Encoder Reranking (Relevance Grading)

Purpose: Moving from "Bi-Encoder" (fast but broad) to "Cross-Encoder" (slow but highly accurate).

  • The Problem: Dense embeddings (Bi-Encoders) are great at finding "similar" text but often struggle with fine-grained nuances or contradictory statements.
  • Process:
    • The system takes the Top 20 results from the broad search.
    • Each [Query, Chunk] pair is passed through the CrossEncoder model (ms-marco-TinyBERT-L-2-v2).
    • The model produces a raw relevance score. This is significantly more accurate than pure cosine similarity from the vector search.

Step D: Diversity Filtering - MMR (Information Density)

Purpose: Preventing "Echo Chambers" or redundant context windows.

  • The Problem: Five news articles starting with the same AP wire sentence will fill the LLM context with redundant text.
  • Process:
    • Implemented Maximal Marginal Relevance (MMR).
    • Logic selects documents that have high relevance but low similarity to already selected documents.
  • Example:
    • Selection 1: A factual report of a merger.
    • Selection 2 (Rejected): Another factual report of the same merger.
    • Selection 2 (Accepted): A financial analyst's opinion on the same merger.

Step E: Parent Document Retrieval (Context Expansion)

Purpose: Providing the "Full Picture" when a snippet isn't enough.

  • Process:
    • Small chunks (~500 chars) are indexed for surgical search accuracy.
    • If a chunk's rerank score is > 0.8, its unique doc_id is used to fetch the full parent article body from ClickHouse/Qdrant.
    • This allows the LLM to see the surrounding context that might have been lost in the chunking process.

4. Generation & Enrichment

Step F: ClickHouse Trend Fusion (External Intelligence)

Purpose: Grounding the LLM in real-time metadata.

  • Process:
    • Parallel to the LLM call, the system queries the ClickHouse Data Warehouse.
    • It extracts trending entities and sentiment scores for the last 3 days relevant to the query.
    • This "Trend Knowledge" is injected into the system prompt.
  • Benefit: The LLM can say: "Retrieval articles show X, but ClickHouse trends show that sentiment for this topic is currently shifting negative."

Step G: Streaming Generation - SSE (Real-Time UX)

Purpose: Minimizing "Perceived Latency".

  • Process:
    • Uses FastAPI StreamingResponse and Server-Sent Events (SSE).
    • Instead of waiting 5 seconds for a full paragraph, the first token is displayed within 200-400ms.
    • Tokens are pushed to the client in real-time as the LLM predicts them.

5. Traceability & Feedback Loop

Step H: Interaction Logging (Audit Trail)

  • Traceability: Every AI response logs the exact list of retrieved_doc_ids (Source IDs) in PostgreSQL.
  • Learning Loop: When a user gives a "Thumbs Down", developers can query the database to see exactly which sources were used. This allows for Negative Sampling (identifying which articles cause hallucination or bad answers).

Technical Stack Overview

Stage Tool/Model
Embeddings BAAI/bge-m3 (BAAI)
Reranking ms-marco-TinyBERT-L-2-v2 (CrossEncoder)
Diversity Custom MMR Implementation
Vector DB Qdrant
Data Warehouse ClickHouse
Token Control tiktoken (cl100k_base)
LLM OpenAI gpt-4

Full Data Flow Visual

graph TD
    User((User)) -->|Query| API[RAG API]
    API -->|Prompt| LLM_Rewriter[LLM Rewriter]
    LLM_Rewriter -->|Standalone Query| API
    
    API -.->|Circuit Breaker Check| VDB{Qdrant Online?}
    VDB -->|No| CH_FB[ClickHouse Keyword Fallback]
    VDB -->|Yes| V_Search[Hybrid Vector Search]
    
    V_Search -->|Top 20| Rerank[Cross-Encoder Reranker]
    Rerank -->|Diversity Pass| MMR[MMR Filter]
    MMR -->|Top K| Parent_Fetch[Parent Doc Retrieval]
    
    Parent_Fetch -->|Context| Prompt_Build[Prompt Construction]
    Prompt_Build -->|Inject| CH_Trends[ClickHouse Trends]
    
    CH_Trends -->|Full Prompt| LLM_Stream[LLM Streaming]
    LLM_Stream -->|SSE Tokens| User
    
    LLM_Stream -->|Trace| Postgres[(Interaction DB)]