rag-api-node-1 / docs /RAG_RETRIEVAL_FLOW.md
Peterase's picture
feat(rag): implement hybrid search with live sources and production-grade intent classification
a63c61f
# State-of-the-Art (SOTA) RAG Retrieval Data Flow
This document details the end-to-end data flow of the News Pipeline RAG API, incorporating advanced patterns for accuracy, diversity, and production resilience.
## 1. Pre-Processing & Infrastructure (The "Cold-Start" Layer)
To ensure **zero-latency** during the initial user interaction, the system implements a preemptive resource loading strategy.
### A. Async Pre-warming (Hidden Latency Absorption)
- **Challenge**: Large Transformer models (like BGE-M3 and Cross-Encoders) typically take 5–15 seconds to load from disk to RAM/VRAM. Lazy-loading these on the first request creates an unacceptable user experience.
- **Process**:
- In `main.py`, the `@app.on_event("startup")` hook triggers a non-blocking `threading.Thread`.
- This background thread immediately initializes `EmbedderService` and `RerankerService`.
- By the time the web server is live and the user types their first query, the models are fully resident in memory, resulting in sub-second response times for the very first request.
### B. Circuit Breaker: ClickHouse Fallback (Always-On Reliability)
- **Challenge**: Vector databases like Qdrant can occasionally experience network partitions or downtime. In a naive RAG, this would crash the conversation.
- **Process**:
- The `VectorStore.search` method is wrapped in a robust `try-except` block.
- If the Qdrant client connection fails or a timeout occurs, the **Circuit Breaker** trips.
- The system automatically redirects the query to `fallback_keyword_search()` in ClickHouse.
- **Mechanism**: It performs a rapid SQL-based keyword search on titles and content in the `sentiment_results` table. While less semantically accurate than vectors, it ensures the user receives actual relevant news articles instead of a "Service Unavailable" error.
## 2. Request Phase (Conversational Logic)
### Step A: Query Transformation (Contextual Synthesis)
**Purpose**: Bridging the gap between human conversation and vector search requirements.
- **The Problem**: Users often ask relative questions like *"What about their stock?"*. Vector databases cannot resolve "their" without context.
- **Process**:
- The API retrieves the last 6 messages from PostgreSQL.
- A specialized prompt instructs `GPT-4` to synthesize the conversation history and the new user query into a single **Standalone Search Query**.
- If history is empty, the original query is used.
- **Example Trace**:
- **History**: `User: Tell me about Nvidia's revenue last year.`
- **New Query**: `User: Did Intel do better?`
- **Synthesized Search Query**: *"Comparison of Intel and Nvidia's revenue for the last fiscal year"*
### Step B: Intent-Based Search (Hybrid & Recency)
**Purpose**: Combining semantic depth with keyword precision and news freshness.
#### 1. Hybrid Vector Synthesis
- **Dense Layer**: Uses `BAAI/bge-m3` to produce a 1024-dimensional semantic embedding. This handles "vibe" and "concept" matching (e.g., matching "financial struggle" to "bankruptcy").
- **Sparse Layer**: Prepares slots for keyword-specific vectors (e.g., Splade or BGE-M3 Sparse). This handles exact entities, ticker symbols (e.g., "NVDA"), or specific dates that dense embeddings might blur.
#### 2. Temporal Decay (Recency Boosting)
- **Logic**: News is a deteriorating asset. The system applies a **Recency Multiplier** during the retrieval collection phase.
- **Formula**: `Score = Base_Similarity * (1.0 - (days_old / 60))`.
- **Constraint**: The multiplier never drops below `0.5`, ensuring that very relevant historical news is still retrievable but newer coverage is naturally prioritized.
- **Example**:
- Article A (Identical match, 60 days old): `Final Score = 0.9 * 0.5 = 0.45`
- Article B (Close match, today): `Final Score = 0.8 * 1.0 = 0.8`
- **Result**: Article B is ranked higher despite slightly lower semantic similarity.
## 3. Retrieval Refinement (The "Precision" Layer)
### Step C: Cross-Encoder Reranking (Relevance Grading)
**Purpose**: Moving from "Bi-Encoder" (fast but broad) to "Cross-Encoder" (slow but highly accurate).
- **The Problem**: Dense embeddings (Bi-Encoders) are great at finding "similar" text but often struggle with fine-grained nuances or contradictory statements.
- **Process**:
- The system takes the **Top 20** results from the broad search.
- Each [Query, Chunk] pair is passed through the `CrossEncoder` model (`ms-marco-TinyBERT-L-2-v2`).
- The model produces a raw relevance score. This is significantly more accurate than pure cosine similarity from the vector search.
### Step D: Diversity Filtering - MMR (Information Density)
**Purpose**: Preventing "Echo Chambers" or redundant context windows.
- **The Problem**: Five news articles starting with the same AP wire sentence will fill the LLM context with redundant text.
- **Process**:
- Implemented **Maximal Marginal Relevance (MMR)**.
- Logic selects documents that have high relevance but **low similarity** to already selected documents.
- **Example**:
- *Selection 1*: A factual report of a merger.
- *Selection 2 (Rejected)*: Another factual report of the same merger.
- *Selection 2 (Accepted)*: A financial analyst's opinion on the same merger.
### Step E: Parent Document Retrieval (Context Expansion)
**Purpose**: Providing the "Full Picture" when a snippet isn't enough.
- **Process**:
- Small chunks (~500 chars) are indexed for surgical search accuracy.
- If a chunk's rerank score is **> 0.8**, its unique `doc_id` is used to fetch the full parent article body from ClickHouse/Qdrant.
- This allows the LLM to see the surrounding context that might have been lost in the chunking process.
---
## 4. Generation & Enrichment
### Step F: ClickHouse Trend Fusion (External Intelligence)
**Purpose**: Grounding the LLM in real-time metadata.
- **Process**:
- Parallel to the LLM call, the system queries the **ClickHouse Data Warehouse**.
- It extracts trending entities and sentiment scores for the last 3 days relevant to the query.
- This "Trend Knowledge" is injected into the system prompt.
- **Benefit**: The LLM can say: *"Retrieval articles show X, but ClickHouse trends show that sentiment for this topic is currently shifting negative."*
### Step G: Streaming Generation - SSE (Real-Time UX)
**Purpose**: Minimizing "Perceived Latency".
- **Process**:
- Uses FastAPI `StreamingResponse` and Server-Sent Events (SSE).
- Instead of waiting 5 seconds for a full paragraph, the first token is displayed within **200-400ms**.
- Tokens are pushed to the client in real-time as the LLM predicts them.
---
## 5. Traceability & Feedback Loop
### Step H: Interaction Logging (Audit Trail)
- **Traceability**: Every AI response logs the exact list of `retrieved_doc_ids` (Source IDs) in PostgreSQL.
- **Learning Loop**: When a user gives a "Thumbs Down", developers can query the database to see exactly which sources were used. This allows for **Negative Sampling** (identifying which articles cause hallucination or bad answers).
---
## Technical Stack Overview
| Stage | Tool/Model |
| :--- | :--- |
| **Embeddings** | `BAAI/bge-m3` (BAAI) |
| **Reranking** | `ms-marco-TinyBERT-L-2-v2` (CrossEncoder) |
| **Diversity** | Custom MMR Implementation |
| **Vector DB** | Qdrant |
| **Data Warehouse**| ClickHouse |
| **Token Control** | `tiktoken` (cl100k_base) |
| **LLM** | OpenAI `gpt-4` |
---
## Full Data Flow Visual
```mermaid
graph TD
User((User)) -->|Query| API[RAG API]
API -->|Prompt| LLM_Rewriter[LLM Rewriter]
LLM_Rewriter -->|Standalone Query| API
API -.->|Circuit Breaker Check| VDB{Qdrant Online?}
VDB -->|No| CH_FB[ClickHouse Keyword Fallback]
VDB -->|Yes| V_Search[Hybrid Vector Search]
V_Search -->|Top 20| Rerank[Cross-Encoder Reranker]
Rerank -->|Diversity Pass| MMR[MMR Filter]
MMR -->|Top K| Parent_Fetch[Parent Doc Retrieval]
Parent_Fetch -->|Context| Prompt_Build[Prompt Construction]
Prompt_Build -->|Inject| CH_Trends[ClickHouse Trends]
CH_Trends -->|Full Prompt| LLM_Stream[LLM Streaming]
LLM_Stream -->|SSE Tokens| User
LLM_Stream -->|Trace| Postgres[(Interaction DB)]
```