Spaces:

Peterase
/

rag-api-node-1

Running

File size: 8,290 Bytes

a63c61f

# State-of-the-Art (SOTA) RAG Retrieval Data Flow

This document details the end-to-end data flow of the News Pipeline RAG API, incorporating advanced patterns for accuracy, diversity, and production resilience.

## 1. Pre-Processing & Infrastructure (The "Cold-Start" Layer)
To ensure **zero-latency** during the initial user interaction, the system implements a preemptive resource loading strategy.

### A. Async Pre-warming (Hidden Latency Absorption)
- **Challenge**: Large Transformer models (like BGE-M3 and Cross-Encoders) typically take 5–15 seconds to load from disk to RAM/VRAM. Lazy-loading these on the first request creates an unacceptable user experience.
- **Process**: 
    - In `main.py`, the `@app.on_event("startup")` hook triggers a non-blocking `threading.Thread`.
    - This background thread immediately initializes `EmbedderService` and `RerankerService`.
    - By the time the web server is live and the user types their first query, the models are fully resident in memory, resulting in sub-second response times for the very first request.

### B. Circuit Breaker: ClickHouse Fallback (Always-On Reliability)
- **Challenge**: Vector databases like Qdrant can occasionally experience network partitions or downtime. In a naive RAG, this would crash the conversation.
- **Process**: 
    - The `VectorStore.search` method is wrapped in a robust `try-except` block.
    - If the Qdrant client connection fails or a timeout occurs, the **Circuit Breaker** trips.
    - The system automatically redirects the query to `fallback_keyword_search()` in ClickHouse.
    - **Mechanism**: It performs a rapid SQL-based keyword search on titles and content in the `sentiment_results` table. While less semantically accurate than vectors, it ensures the user receives actual relevant news articles instead of a "Service Unavailable" error.

## 2. Request Phase (Conversational Logic)

### Step A: Query Transformation (Contextual Synthesis)
**Purpose**: Bridging the gap between human conversation and vector search requirements.
- **The Problem**: Users often ask relative questions like *"What about their stock?"*. Vector databases cannot resolve "their" without context.
- **Process**: 
    - The API retrieves the last 6 messages from PostgreSQL.
    - A specialized prompt instructs `GPT-4` to synthesize the conversation history and the new user query into a single **Standalone Search Query**.
    - If history is empty, the original query is used.
- **Example Trace**: 
    - **History**: `User: Tell me about Nvidia's revenue last year.`
    - **New Query**: `User: Did Intel do better?`
    - **Synthesized Search Query**: *"Comparison of Intel and Nvidia's revenue for the last fiscal year"*

### Step B: Intent-Based Search (Hybrid & Recency)
**Purpose**: Combining semantic depth with keyword precision and news freshness.

#### 1. Hybrid Vector Synthesis
- **Dense Layer**: Uses `BAAI/bge-m3` to produce a 1024-dimensional semantic embedding. This handles "vibe" and "concept" matching (e.g., matching "financial struggle" to "bankruptcy").
- **Sparse Layer**: Prepares slots for keyword-specific vectors (e.g., Splade or BGE-M3 Sparse). This handles exact entities, ticker symbols (e.g., "NVDA"), or specific dates that dense embeddings might blur.

#### 2. Temporal Decay (Recency Boosting)
- **Logic**: News is a deteriorating asset. The system applies a **Recency Multiplier** during the retrieval collection phase.
- **Formula**: `Score = Base_Similarity * (1.0 - (days_old / 60))`.
- **Constraint**: The multiplier never drops below `0.5`, ensuring that very relevant historical news is still retrievable but newer coverage is naturally prioritized.
- **Example**: 
    - Article A (Identical match, 60 days old): `Final Score = 0.9 * 0.5 = 0.45`
    - Article B (Close match, today): `Final Score = 0.8 * 1.0 = 0.8`
    - **Result**: Article B is ranked higher despite slightly lower semantic similarity.

## 3. Retrieval Refinement (The "Precision" Layer)

### Step C: Cross-Encoder Reranking (Relevance Grading)
**Purpose**: Moving from "Bi-Encoder" (fast but broad) to "Cross-Encoder" (slow but highly accurate).
- **The Problem**: Dense embeddings (Bi-Encoders) are great at finding "similar" text but often struggle with fine-grained nuances or contradictory statements.
- **Process**: 
    - The system takes the **Top 20** results from the broad search.
    - Each [Query, Chunk] pair is passed through the `CrossEncoder` model (`ms-marco-TinyBERT-L-2-v2`).
    - The model produces a raw relevance score. This is significantly more accurate than pure cosine similarity from the vector search.

### Step D: Diversity Filtering - MMR (Information Density)
**Purpose**: Preventing "Echo Chambers" or redundant context windows.
- **The Problem**: Five news articles starting with the same AP wire sentence will fill the LLM context with redundant text.
- **Process**: 
    - Implemented **Maximal Marginal Relevance (MMR)**.
    - Logic selects documents that have high relevance but **low similarity** to already selected documents.
- **Example**: 
    - *Selection 1*: A factual report of a merger.
    - *Selection 2 (Rejected)*: Another factual report of the same merger.
    - *Selection 2 (Accepted)*: A financial analyst's opinion on the same merger.

### Step E: Parent Document Retrieval (Context Expansion)
**Purpose**: Providing the "Full Picture" when a snippet isn't enough.
- **Process**: 
    - Small chunks (~500 chars) are indexed for surgical search accuracy.
    - If a chunk's rerank score is **> 0.8**, its unique `doc_id` is used to fetch the full parent article body from ClickHouse/Qdrant.
    - This allows the LLM to see the surrounding context that might have been lost in the chunking process.

---

## 4. Generation & Enrichment

### Step F: ClickHouse Trend Fusion (External Intelligence)
**Purpose**: Grounding the LLM in real-time metadata.
- **Process**: 
    - Parallel to the LLM call, the system queries the **ClickHouse Data Warehouse**.
    - It extracts trending entities and sentiment scores for the last 3 days relevant to the query.
    - This "Trend Knowledge" is injected into the system prompt.
- **Benefit**: The LLM can say: *"Retrieval articles show X, but ClickHouse trends show that sentiment for this topic is currently shifting negative."*

### Step G: Streaming Generation - SSE (Real-Time UX)
**Purpose**: Minimizing "Perceived Latency".
- **Process**: 
    - Uses FastAPI `StreamingResponse` and Server-Sent Events (SSE).
    - Instead of waiting 5 seconds for a full paragraph, the first token is displayed within **200-400ms**.
    - Tokens are pushed to the client in real-time as the LLM predicts them.

---

## 5. Traceability & Feedback Loop

### Step H: Interaction Logging (Audit Trail)
- **Traceability**: Every AI response logs the exact list of `retrieved_doc_ids` (Source IDs) in PostgreSQL.
- **Learning Loop**: When a user gives a "Thumbs Down", developers can query the database to see exactly which sources were used. This allows for **Negative Sampling** (identifying which articles cause hallucination or bad answers).

---

## Technical Stack Overview

| Stage | Tool/Model |
| :--- | :--- |
| **Embeddings** | `BAAI/bge-m3` (BAAI) |
| **Reranking** | `ms-marco-TinyBERT-L-2-v2` (CrossEncoder) |
| **Diversity** | Custom MMR Implementation |
| **Vector DB** | Qdrant |
| **Data Warehouse**| ClickHouse |
| **Token Control** | `tiktoken` (cl100k_base) |
| **LLM** | OpenAI `gpt-4` |

---

## Full Data Flow Visual

```mermaid
graph TD
    User((User)) -->|Query| API[RAG API]
    API -->|Prompt| LLM_Rewriter[LLM Rewriter]
    LLM_Rewriter -->|Standalone Query| API
    
    API -.->|Circuit Breaker Check| VDB{Qdrant Online?}
    VDB -->|No| CH_FB[ClickHouse Keyword Fallback]
    VDB -->|Yes| V_Search[Hybrid Vector Search]
    
    V_Search -->|Top 20| Rerank[Cross-Encoder Reranker]
    Rerank -->|Diversity Pass| MMR[MMR Filter]
    MMR -->|Top K| Parent_Fetch[Parent Doc Retrieval]
    
    Parent_Fetch -->|Context| Prompt_Build[Prompt Construction]
    Prompt_Build -->|Inject| CH_Trends[ClickHouse Trends]
    
    CH_Trends -->|Full Prompt| LLM_Stream[LLM Streaming]
    LLM_Stream -->|SSE Tokens| User
    
    LLM_Stream -->|Trace| Postgres[(Interaction DB)]
```