Spaces:
Running
Running
File size: 8,290 Bytes
a63c61f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | # State-of-the-Art (SOTA) RAG Retrieval Data Flow
This document details the end-to-end data flow of the News Pipeline RAG API, incorporating advanced patterns for accuracy, diversity, and production resilience.
## 1. Pre-Processing & Infrastructure (The "Cold-Start" Layer)
To ensure **zero-latency** during the initial user interaction, the system implements a preemptive resource loading strategy.
### A. Async Pre-warming (Hidden Latency Absorption)
- **Challenge**: Large Transformer models (like BGE-M3 and Cross-Encoders) typically take 5–15 seconds to load from disk to RAM/VRAM. Lazy-loading these on the first request creates an unacceptable user experience.
- **Process**:
- In `main.py`, the `@app.on_event("startup")` hook triggers a non-blocking `threading.Thread`.
- This background thread immediately initializes `EmbedderService` and `RerankerService`.
- By the time the web server is live and the user types their first query, the models are fully resident in memory, resulting in sub-second response times for the very first request.
### B. Circuit Breaker: ClickHouse Fallback (Always-On Reliability)
- **Challenge**: Vector databases like Qdrant can occasionally experience network partitions or downtime. In a naive RAG, this would crash the conversation.
- **Process**:
- The `VectorStore.search` method is wrapped in a robust `try-except` block.
- If the Qdrant client connection fails or a timeout occurs, the **Circuit Breaker** trips.
- The system automatically redirects the query to `fallback_keyword_search()` in ClickHouse.
- **Mechanism**: It performs a rapid SQL-based keyword search on titles and content in the `sentiment_results` table. While less semantically accurate than vectors, it ensures the user receives actual relevant news articles instead of a "Service Unavailable" error.
## 2. Request Phase (Conversational Logic)
### Step A: Query Transformation (Contextual Synthesis)
**Purpose**: Bridging the gap between human conversation and vector search requirements.
- **The Problem**: Users often ask relative questions like *"What about their stock?"*. Vector databases cannot resolve "their" without context.
- **Process**:
- The API retrieves the last 6 messages from PostgreSQL.
- A specialized prompt instructs `GPT-4` to synthesize the conversation history and the new user query into a single **Standalone Search Query**.
- If history is empty, the original query is used.
- **Example Trace**:
- **History**: `User: Tell me about Nvidia's revenue last year.`
- **New Query**: `User: Did Intel do better?`
- **Synthesized Search Query**: *"Comparison of Intel and Nvidia's revenue for the last fiscal year"*
### Step B: Intent-Based Search (Hybrid & Recency)
**Purpose**: Combining semantic depth with keyword precision and news freshness.
#### 1. Hybrid Vector Synthesis
- **Dense Layer**: Uses `BAAI/bge-m3` to produce a 1024-dimensional semantic embedding. This handles "vibe" and "concept" matching (e.g., matching "financial struggle" to "bankruptcy").
- **Sparse Layer**: Prepares slots for keyword-specific vectors (e.g., Splade or BGE-M3 Sparse). This handles exact entities, ticker symbols (e.g., "NVDA"), or specific dates that dense embeddings might blur.
#### 2. Temporal Decay (Recency Boosting)
- **Logic**: News is a deteriorating asset. The system applies a **Recency Multiplier** during the retrieval collection phase.
- **Formula**: `Score = Base_Similarity * (1.0 - (days_old / 60))`.
- **Constraint**: The multiplier never drops below `0.5`, ensuring that very relevant historical news is still retrievable but newer coverage is naturally prioritized.
- **Example**:
- Article A (Identical match, 60 days old): `Final Score = 0.9 * 0.5 = 0.45`
- Article B (Close match, today): `Final Score = 0.8 * 1.0 = 0.8`
- **Result**: Article B is ranked higher despite slightly lower semantic similarity.
## 3. Retrieval Refinement (The "Precision" Layer)
### Step C: Cross-Encoder Reranking (Relevance Grading)
**Purpose**: Moving from "Bi-Encoder" (fast but broad) to "Cross-Encoder" (slow but highly accurate).
- **The Problem**: Dense embeddings (Bi-Encoders) are great at finding "similar" text but often struggle with fine-grained nuances or contradictory statements.
- **Process**:
- The system takes the **Top 20** results from the broad search.
- Each [Query, Chunk] pair is passed through the `CrossEncoder` model (`ms-marco-TinyBERT-L-2-v2`).
- The model produces a raw relevance score. This is significantly more accurate than pure cosine similarity from the vector search.
### Step D: Diversity Filtering - MMR (Information Density)
**Purpose**: Preventing "Echo Chambers" or redundant context windows.
- **The Problem**: Five news articles starting with the same AP wire sentence will fill the LLM context with redundant text.
- **Process**:
- Implemented **Maximal Marginal Relevance (MMR)**.
- Logic selects documents that have high relevance but **low similarity** to already selected documents.
- **Example**:
- *Selection 1*: A factual report of a merger.
- *Selection 2 (Rejected)*: Another factual report of the same merger.
- *Selection 2 (Accepted)*: A financial analyst's opinion on the same merger.
### Step E: Parent Document Retrieval (Context Expansion)
**Purpose**: Providing the "Full Picture" when a snippet isn't enough.
- **Process**:
- Small chunks (~500 chars) are indexed for surgical search accuracy.
- If a chunk's rerank score is **> 0.8**, its unique `doc_id` is used to fetch the full parent article body from ClickHouse/Qdrant.
- This allows the LLM to see the surrounding context that might have been lost in the chunking process.
---
## 4. Generation & Enrichment
### Step F: ClickHouse Trend Fusion (External Intelligence)
**Purpose**: Grounding the LLM in real-time metadata.
- **Process**:
- Parallel to the LLM call, the system queries the **ClickHouse Data Warehouse**.
- It extracts trending entities and sentiment scores for the last 3 days relevant to the query.
- This "Trend Knowledge" is injected into the system prompt.
- **Benefit**: The LLM can say: *"Retrieval articles show X, but ClickHouse trends show that sentiment for this topic is currently shifting negative."*
### Step G: Streaming Generation - SSE (Real-Time UX)
**Purpose**: Minimizing "Perceived Latency".
- **Process**:
- Uses FastAPI `StreamingResponse` and Server-Sent Events (SSE).
- Instead of waiting 5 seconds for a full paragraph, the first token is displayed within **200-400ms**.
- Tokens are pushed to the client in real-time as the LLM predicts them.
---
## 5. Traceability & Feedback Loop
### Step H: Interaction Logging (Audit Trail)
- **Traceability**: Every AI response logs the exact list of `retrieved_doc_ids` (Source IDs) in PostgreSQL.
- **Learning Loop**: When a user gives a "Thumbs Down", developers can query the database to see exactly which sources were used. This allows for **Negative Sampling** (identifying which articles cause hallucination or bad answers).
---
## Technical Stack Overview
| Stage | Tool/Model |
| :--- | :--- |
| **Embeddings** | `BAAI/bge-m3` (BAAI) |
| **Reranking** | `ms-marco-TinyBERT-L-2-v2` (CrossEncoder) |
| **Diversity** | Custom MMR Implementation |
| **Vector DB** | Qdrant |
| **Data Warehouse**| ClickHouse |
| **Token Control** | `tiktoken` (cl100k_base) |
| **LLM** | OpenAI `gpt-4` |
---
## Full Data Flow Visual
```mermaid
graph TD
User((User)) -->|Query| API[RAG API]
API -->|Prompt| LLM_Rewriter[LLM Rewriter]
LLM_Rewriter -->|Standalone Query| API
API -.->|Circuit Breaker Check| VDB{Qdrant Online?}
VDB -->|No| CH_FB[ClickHouse Keyword Fallback]
VDB -->|Yes| V_Search[Hybrid Vector Search]
V_Search -->|Top 20| Rerank[Cross-Encoder Reranker]
Rerank -->|Diversity Pass| MMR[MMR Filter]
MMR -->|Top K| Parent_Fetch[Parent Doc Retrieval]
Parent_Fetch -->|Context| Prompt_Build[Prompt Construction]
Prompt_Build -->|Inject| CH_Trends[ClickHouse Trends]
CH_Trends -->|Full Prompt| LLM_Stream[LLM Streaming]
LLM_Stream -->|SSE Tokens| User
LLM_Stream -->|Trace| Postgres[(Interaction DB)]
```
|