File size: 8,290 Bytes
a63c61f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# State-of-the-Art (SOTA) RAG Retrieval Data Flow

This document details the end-to-end data flow of the News Pipeline RAG API, incorporating advanced patterns for accuracy, diversity, and production resilience.

## 1. Pre-Processing & Infrastructure (The "Cold-Start" Layer)
To ensure **zero-latency** during the initial user interaction, the system implements a preemptive resource loading strategy.

### A. Async Pre-warming (Hidden Latency Absorption)
- **Challenge**: Large Transformer models (like BGE-M3 and Cross-Encoders) typically take 5–15 seconds to load from disk to RAM/VRAM. Lazy-loading these on the first request creates an unacceptable user experience.
- **Process**: 
    - In `main.py`, the `@app.on_event("startup")` hook triggers a non-blocking `threading.Thread`.
    - This background thread immediately initializes `EmbedderService` and `RerankerService`.
    - By the time the web server is live and the user types their first query, the models are fully resident in memory, resulting in sub-second response times for the very first request.

### B. Circuit Breaker: ClickHouse Fallback (Always-On Reliability)
- **Challenge**: Vector databases like Qdrant can occasionally experience network partitions or downtime. In a naive RAG, this would crash the conversation.
- **Process**: 
    - The `VectorStore.search` method is wrapped in a robust `try-except` block.
    - If the Qdrant client connection fails or a timeout occurs, the **Circuit Breaker** trips.
    - The system automatically redirects the query to `fallback_keyword_search()` in ClickHouse.
    - **Mechanism**: It performs a rapid SQL-based keyword search on titles and content in the `sentiment_results` table. While less semantically accurate than vectors, it ensures the user receives actual relevant news articles instead of a "Service Unavailable" error.

## 2. Request Phase (Conversational Logic)

### Step A: Query Transformation (Contextual Synthesis)
**Purpose**: Bridging the gap between human conversation and vector search requirements.
- **The Problem**: Users often ask relative questions like *"What about their stock?"*. Vector databases cannot resolve "their" without context.
- **Process**: 
    - The API retrieves the last 6 messages from PostgreSQL.
    - A specialized prompt instructs `GPT-4` to synthesize the conversation history and the new user query into a single **Standalone Search Query**.
    - If history is empty, the original query is used.
- **Example Trace**: 
    - **History**: `User: Tell me about Nvidia's revenue last year.`
    - **New Query**: `User: Did Intel do better?`
    - **Synthesized Search Query**: *"Comparison of Intel and Nvidia's revenue for the last fiscal year"*

### Step B: Intent-Based Search (Hybrid & Recency)
**Purpose**: Combining semantic depth with keyword precision and news freshness.

#### 1. Hybrid Vector Synthesis
- **Dense Layer**: Uses `BAAI/bge-m3` to produce a 1024-dimensional semantic embedding. This handles "vibe" and "concept" matching (e.g., matching "financial struggle" to "bankruptcy").
- **Sparse Layer**: Prepares slots for keyword-specific vectors (e.g., Splade or BGE-M3 Sparse). This handles exact entities, ticker symbols (e.g., "NVDA"), or specific dates that dense embeddings might blur.

#### 2. Temporal Decay (Recency Boosting)
- **Logic**: News is a deteriorating asset. The system applies a **Recency Multiplier** during the retrieval collection phase.
- **Formula**: `Score = Base_Similarity * (1.0 - (days_old / 60))`.
- **Constraint**: The multiplier never drops below `0.5`, ensuring that very relevant historical news is still retrievable but newer coverage is naturally prioritized.
- **Example**: 
    - Article A (Identical match, 60 days old): `Final Score = 0.9 * 0.5 = 0.45`
    - Article B (Close match, today): `Final Score = 0.8 * 1.0 = 0.8`
    - **Result**: Article B is ranked higher despite slightly lower semantic similarity.

## 3. Retrieval Refinement (The "Precision" Layer)

### Step C: Cross-Encoder Reranking (Relevance Grading)
**Purpose**: Moving from "Bi-Encoder" (fast but broad) to "Cross-Encoder" (slow but highly accurate).
- **The Problem**: Dense embeddings (Bi-Encoders) are great at finding "similar" text but often struggle with fine-grained nuances or contradictory statements.
- **Process**: 
    - The system takes the **Top 20** results from the broad search.
    - Each [Query, Chunk] pair is passed through the `CrossEncoder` model (`ms-marco-TinyBERT-L-2-v2`).
    - The model produces a raw relevance score. This is significantly more accurate than pure cosine similarity from the vector search.

### Step D: Diversity Filtering - MMR (Information Density)
**Purpose**: Preventing "Echo Chambers" or redundant context windows.
- **The Problem**: Five news articles starting with the same AP wire sentence will fill the LLM context with redundant text.
- **Process**: 
    - Implemented **Maximal Marginal Relevance (MMR)**.
    - Logic selects documents that have high relevance but **low similarity** to already selected documents.
- **Example**: 
    - *Selection 1*: A factual report of a merger.
    - *Selection 2 (Rejected)*: Another factual report of the same merger.
    - *Selection 2 (Accepted)*: A financial analyst's opinion on the same merger.

### Step E: Parent Document Retrieval (Context Expansion)
**Purpose**: Providing the "Full Picture" when a snippet isn't enough.
- **Process**: 
    - Small chunks (~500 chars) are indexed for surgical search accuracy.
    - If a chunk's rerank score is **> 0.8**, its unique `doc_id` is used to fetch the full parent article body from ClickHouse/Qdrant.
    - This allows the LLM to see the surrounding context that might have been lost in the chunking process.

---

## 4. Generation & Enrichment

### Step F: ClickHouse Trend Fusion (External Intelligence)
**Purpose**: Grounding the LLM in real-time metadata.
- **Process**: 
    - Parallel to the LLM call, the system queries the **ClickHouse Data Warehouse**.
    - It extracts trending entities and sentiment scores for the last 3 days relevant to the query.
    - This "Trend Knowledge" is injected into the system prompt.
- **Benefit**: The LLM can say: *"Retrieval articles show X, but ClickHouse trends show that sentiment for this topic is currently shifting negative."*

### Step G: Streaming Generation - SSE (Real-Time UX)
**Purpose**: Minimizing "Perceived Latency".
- **Process**: 
    - Uses FastAPI `StreamingResponse` and Server-Sent Events (SSE).
    - Instead of waiting 5 seconds for a full paragraph, the first token is displayed within **200-400ms**.
    - Tokens are pushed to the client in real-time as the LLM predicts them.

---

## 5. Traceability & Feedback Loop

### Step H: Interaction Logging (Audit Trail)
- **Traceability**: Every AI response logs the exact list of `retrieved_doc_ids` (Source IDs) in PostgreSQL.
- **Learning Loop**: When a user gives a "Thumbs Down", developers can query the database to see exactly which sources were used. This allows for **Negative Sampling** (identifying which articles cause hallucination or bad answers).

---

## Technical Stack Overview

| Stage | Tool/Model |
| :--- | :--- |
| **Embeddings** | `BAAI/bge-m3` (BAAI) |
| **Reranking** | `ms-marco-TinyBERT-L-2-v2` (CrossEncoder) |
| **Diversity** | Custom MMR Implementation |
| **Vector DB** | Qdrant |
| **Data Warehouse**| ClickHouse |
| **Token Control** | `tiktoken` (cl100k_base) |
| **LLM** | OpenAI `gpt-4` |

---

## Full Data Flow Visual

```mermaid
graph TD
    User((User)) -->|Query| API[RAG API]
    API -->|Prompt| LLM_Rewriter[LLM Rewriter]
    LLM_Rewriter -->|Standalone Query| API
    
    API -.->|Circuit Breaker Check| VDB{Qdrant Online?}
    VDB -->|No| CH_FB[ClickHouse Keyword Fallback]
    VDB -->|Yes| V_Search[Hybrid Vector Search]
    
    V_Search -->|Top 20| Rerank[Cross-Encoder Reranker]
    Rerank -->|Diversity Pass| MMR[MMR Filter]
    MMR -->|Top K| Parent_Fetch[Parent Doc Retrieval]
    
    Parent_Fetch -->|Context| Prompt_Build[Prompt Construction]
    Prompt_Build -->|Inject| CH_Trends[ClickHouse Trends]
    
    CH_Trends -->|Full Prompt| LLM_Stream[LLM Streaming]
    LLM_Stream -->|SSE Tokens| User
    
    LLM_Stream -->|Trace| Postgres[(Interaction DB)]
```