Spaces:

Peterase
/

rag-api-node-1

Running

App Files Files Community

rag-api-node-1 / docs /RAG_RETRIEVAL_FLOW.md

Peterase

feat(rag): implement hybrid search with live sources and production-grade intent classification

a63c61f 16 days ago

preview code

raw

history blame contribute delete

8.29 kB

	# State-of-the-Art (SOTA) RAG Retrieval Data Flow

	This document details the end-to-end data flow of the News Pipeline RAG API, incorporating advanced patterns for accuracy, diversity, and production resilience.

	## 1. Pre-Processing & Infrastructure (The "Cold-Start" Layer)
	To ensure zero-latency during the initial user interaction, the system implements a preemptive resource loading strategy.

	### A. Async Pre-warming (Hidden Latency Absorption)
	- Challenge: Large Transformer models (like BGE-M3 and Cross-Encoders) typically take 5–15 seconds to load from disk to RAM/VRAM. Lazy-loading these on the first request creates an unacceptable user experience.
	- Process:
	- In `main.py`, the `@app.on_event("startup")` hook triggers a non-blocking `threading.Thread`.
	- This background thread immediately initializes `EmbedderService` and `RerankerService`.
	- By the time the web server is live and the user types their first query, the models are fully resident in memory, resulting in sub-second response times for the very first request.

	### B. Circuit Breaker: ClickHouse Fallback (Always-On Reliability)
	- Challenge: Vector databases like Qdrant can occasionally experience network partitions or downtime. In a naive RAG, this would crash the conversation.
	- Process:
	- The `VectorStore.search` method is wrapped in a robust `try-except` block.
	- If the Qdrant client connection fails or a timeout occurs, the Circuit Breaker trips.
	- The system automatically redirects the query to `fallback_keyword_search()` in ClickHouse.
	- Mechanism: It performs a rapid SQL-based keyword search on titles and content in the `sentiment_results` table. While less semantically accurate than vectors, it ensures the user receives actual relevant news articles instead of a "Service Unavailable" error.

	## 2. Request Phase (Conversational Logic)

	### Step A: Query Transformation (Contextual Synthesis)
	Purpose: Bridging the gap between human conversation and vector search requirements.
	- The Problem: Users often ask relative questions like "What about their stock?". Vector databases cannot resolve "their" without context.
	- Process:
	- The API retrieves the last 6 messages from PostgreSQL.
	- A specialized prompt instructs `GPT-4` to synthesize the conversation history and the new user query into a single Standalone Search Query.
	- If history is empty, the original query is used.
	- Example Trace:
	- History: `User: Tell me about Nvidia's revenue last year.`
	- New Query: `User: Did Intel do better?`
	- Synthesized Search Query: "Comparison of Intel and Nvidia's revenue for the last fiscal year"

	### Step B: Intent-Based Search (Hybrid & Recency)
	Purpose: Combining semantic depth with keyword precision and news freshness.

	#### 1. Hybrid Vector Synthesis
	- Dense Layer: Uses `BAAI/bge-m3` to produce a 1024-dimensional semantic embedding. This handles "vibe" and "concept" matching (e.g., matching "financial struggle" to "bankruptcy").
	- Sparse Layer: Prepares slots for keyword-specific vectors (e.g., Splade or BGE-M3 Sparse). This handles exact entities, ticker symbols (e.g., "NVDA"), or specific dates that dense embeddings might blur.

	#### 2. Temporal Decay (Recency Boosting)
	- Logic: News is a deteriorating asset. The system applies a Recency Multiplier during the retrieval collection phase.
	- Formula: `Score = Base_Similarity * (1.0 - (days_old / 60))`.
	- Constraint: The multiplier never drops below `0.5`, ensuring that very relevant historical news is still retrievable but newer coverage is naturally prioritized.
	- Example:
	- Article A (Identical match, 60 days old): `Final Score = 0.9 * 0.5 = 0.45`
	- Article B (Close match, today): `Final Score = 0.8 * 1.0 = 0.8`
	- Result: Article B is ranked higher despite slightly lower semantic similarity.

	## 3. Retrieval Refinement (The "Precision" Layer)

	### Step C: Cross-Encoder Reranking (Relevance Grading)
	Purpose: Moving from "Bi-Encoder" (fast but broad) to "Cross-Encoder" (slow but highly accurate).
	- The Problem: Dense embeddings (Bi-Encoders) are great at finding "similar" text but often struggle with fine-grained nuances or contradictory statements.
	- Process:
	- The system takes the Top 20 results from the broad search.
	- Each [Query, Chunk] pair is passed through the `CrossEncoder` model (`ms-marco-TinyBERT-L-2-v2`).
	- The model produces a raw relevance score. This is significantly more accurate than pure cosine similarity from the vector search.

	### Step D: Diversity Filtering - MMR (Information Density)
	Purpose: Preventing "Echo Chambers" or redundant context windows.
	- The Problem: Five news articles starting with the same AP wire sentence will fill the LLM context with redundant text.
	- Process:
	- Implemented Maximal Marginal Relevance (MMR).
	- Logic selects documents that have high relevance but low similarity to already selected documents.
	- Example:
	- Selection 1: A factual report of a merger.
	- Selection 2 (Rejected): Another factual report of the same merger.
	- Selection 2 (Accepted): A financial analyst's opinion on the same merger.

	### Step E: Parent Document Retrieval (Context Expansion)
	Purpose: Providing the "Full Picture" when a snippet isn't enough.
	- Process:
	- Small chunks (~500 chars) are indexed for surgical search accuracy.
	- If a chunk's rerank score is > 0.8, its unique `doc_id` is used to fetch the full parent article body from ClickHouse/Qdrant.
	- This allows the LLM to see the surrounding context that might have been lost in the chunking process.

	---

	## 4. Generation & Enrichment

	### Step F: ClickHouse Trend Fusion (External Intelligence)
	Purpose: Grounding the LLM in real-time metadata.
	- Process:
	- Parallel to the LLM call, the system queries the ClickHouse Data Warehouse.
	- It extracts trending entities and sentiment scores for the last 3 days relevant to the query.
	- This "Trend Knowledge" is injected into the system prompt.
	- Benefit: The LLM can say: "Retrieval articles show X, but ClickHouse trends show that sentiment for this topic is currently shifting negative."

	### Step G: Streaming Generation - SSE (Real-Time UX)
	Purpose: Minimizing "Perceived Latency".
	- Process:
	- Uses FastAPI `StreamingResponse` and Server-Sent Events (SSE).
	- Instead of waiting 5 seconds for a full paragraph, the first token is displayed within 200-400ms.
	- Tokens are pushed to the client in real-time as the LLM predicts them.

	---

	## 5. Traceability & Feedback Loop

	### Step H: Interaction Logging (Audit Trail)
	- Traceability: Every AI response logs the exact list of `retrieved_doc_ids` (Source IDs) in PostgreSQL.
	- Learning Loop: When a user gives a "Thumbs Down", developers can query the database to see exactly which sources were used. This allows for Negative Sampling (identifying which articles cause hallucination or bad answers).

	---

	## Technical Stack Overview

	\| Stage \| Tool/Model \|
	\| :--- \| :--- \|
	\| Embeddings \| `BAAI/bge-m3` (BAAI) \|
	\| Reranking \| `ms-marco-TinyBERT-L-2-v2` (CrossEncoder) \|
	\| Diversity \| Custom MMR Implementation \|
	\| Vector DB \| Qdrant \|
	\| Data Warehouse\| ClickHouse \|
	\| Token Control \| `tiktoken` (cl100k_base) \|
	\| LLM \| OpenAI `gpt-4` \|

	---

	## Full Data Flow Visual

	```mermaid
	graph TD
	User((User)) -->\|Query\| API[RAG API]
	API -->\|Prompt\| LLM_Rewriter[LLM Rewriter]
	LLM_Rewriter -->\|Standalone Query\| API

	API -.->\|Circuit Breaker Check\| VDB{Qdrant Online?}
	VDB -->\|No\| CH_FB[ClickHouse Keyword Fallback]
	VDB -->\|Yes\| V_Search[Hybrid Vector Search]

	V_Search -->\|Top 20\| Rerank[Cross-Encoder Reranker]
	Rerank -->\|Diversity Pass\| MMR[MMR Filter]
	MMR -->\|Top K\| Parent_Fetch[Parent Doc Retrieval]

	Parent_Fetch -->\|Context\| Prompt_Build[Prompt Construction]
	Prompt_Build -->\|Inject\| CH_Trends[ClickHouse Trends]

	CH_Trends -->\|Full Prompt\| LLM_Stream[LLM Streaming]
	LLM_Stream -->\|SSE Tokens\| User

	LLM_Stream -->\|Trace\| Postgres[(Interaction DB)]
	```

	# State-of-the-Art (SOTA) RAG Retrieval Data Flow

	This document details the end-to-end data flow of the News Pipeline RAG API, incorporating advanced patterns for accuracy, diversity, and production resilience.

	## 1. Pre-Processing & Infrastructure (The "Cold-Start" Layer)
	To ensure zero-latency during the initial user interaction, the system implements a preemptive resource loading strategy.

	### A. Async Pre-warming (Hidden Latency Absorption)
	- Challenge: Large Transformer models (like BGE-M3 and Cross-Encoders) typically take 5–15 seconds to load from disk to RAM/VRAM. Lazy-loading these on the first request creates an unacceptable user experience.
	- Process:
	- In `main.py`, the `@app.on_event("startup")` hook triggers a non-blocking `threading.Thread`.
	- This background thread immediately initializes `EmbedderService` and `RerankerService`.
	- By the time the web server is live and the user types their first query, the models are fully resident in memory, resulting in sub-second response times for the very first request.

	### B. Circuit Breaker: ClickHouse Fallback (Always-On Reliability)
	- Challenge: Vector databases like Qdrant can occasionally experience network partitions or downtime. In a naive RAG, this would crash the conversation.
	- Process:
	- The `VectorStore.search` method is wrapped in a robust `try-except` block.
	- If the Qdrant client connection fails or a timeout occurs, the Circuit Breaker trips.
	- The system automatically redirects the query to `fallback_keyword_search()` in ClickHouse.
	- Mechanism: It performs a rapid SQL-based keyword search on titles and content in the `sentiment_results` table. While less semantically accurate than vectors, it ensures the user receives actual relevant news articles instead of a "Service Unavailable" error.

	## 2. Request Phase (Conversational Logic)

	### Step A: Query Transformation (Contextual Synthesis)
	Purpose: Bridging the gap between human conversation and vector search requirements.
	- The Problem: Users often ask relative questions like "What about their stock?". Vector databases cannot resolve "their" without context.
	- Process:
	- The API retrieves the last 6 messages from PostgreSQL.
	- A specialized prompt instructs `GPT-4` to synthesize the conversation history and the new user query into a single Standalone Search Query.
	- If history is empty, the original query is used.
	- Example Trace:
	- History: `User: Tell me about Nvidia's revenue last year.`
	- New Query: `User: Did Intel do better?`
	- Synthesized Search Query: "Comparison of Intel and Nvidia's revenue for the last fiscal year"

	### Step B: Intent-Based Search (Hybrid & Recency)
	Purpose: Combining semantic depth with keyword precision and news freshness.

	#### 1. Hybrid Vector Synthesis
	- Dense Layer: Uses `BAAI/bge-m3` to produce a 1024-dimensional semantic embedding. This handles "vibe" and "concept" matching (e.g., matching "financial struggle" to "bankruptcy").
	- Sparse Layer: Prepares slots for keyword-specific vectors (e.g., Splade or BGE-M3 Sparse). This handles exact entities, ticker symbols (e.g., "NVDA"), or specific dates that dense embeddings might blur.

	#### 2. Temporal Decay (Recency Boosting)
	- Logic: News is a deteriorating asset. The system applies a Recency Multiplier during the retrieval collection phase.
	- Formula: `Score = Base_Similarity * (1.0 - (days_old / 60))`.
	- Constraint: The multiplier never drops below `0.5`, ensuring that very relevant historical news is still retrievable but newer coverage is naturally prioritized.
	- Example:
	- Article A (Identical match, 60 days old): `Final Score = 0.9 * 0.5 = 0.45`
	- Article B (Close match, today): `Final Score = 0.8 * 1.0 = 0.8`
	- Result: Article B is ranked higher despite slightly lower semantic similarity.

	## 3. Retrieval Refinement (The "Precision" Layer)

	### Step C: Cross-Encoder Reranking (Relevance Grading)
	Purpose: Moving from "Bi-Encoder" (fast but broad) to "Cross-Encoder" (slow but highly accurate).
	- The Problem: Dense embeddings (Bi-Encoders) are great at finding "similar" text but often struggle with fine-grained nuances or contradictory statements.
	- Process:
	- The system takes the Top 20 results from the broad search.
	- Each [Query, Chunk] pair is passed through the `CrossEncoder` model (`ms-marco-TinyBERT-L-2-v2`).
	- The model produces a raw relevance score. This is significantly more accurate than pure cosine similarity from the vector search.

	### Step D: Diversity Filtering - MMR (Information Density)
	Purpose: Preventing "Echo Chambers" or redundant context windows.
	- The Problem: Five news articles starting with the same AP wire sentence will fill the LLM context with redundant text.
	- Process:
	- Implemented Maximal Marginal Relevance (MMR).
	- Logic selects documents that have high relevance but low similarity to already selected documents.
	- Example:
	- Selection 1: A factual report of a merger.
	- Selection 2 (Rejected): Another factual report of the same merger.
	- Selection 2 (Accepted): A financial analyst's opinion on the same merger.

	### Step E: Parent Document Retrieval (Context Expansion)
	Purpose: Providing the "Full Picture" when a snippet isn't enough.
	- Process:
	- Small chunks (~500 chars) are indexed for surgical search accuracy.
	- If a chunk's rerank score is > 0.8, its unique `doc_id` is used to fetch the full parent article body from ClickHouse/Qdrant.
	- This allows the LLM to see the surrounding context that might have been lost in the chunking process.

	---

	## 4. Generation & Enrichment

	### Step F: ClickHouse Trend Fusion (External Intelligence)
	Purpose: Grounding the LLM in real-time metadata.
	- Process:
	- Parallel to the LLM call, the system queries the ClickHouse Data Warehouse.
	- It extracts trending entities and sentiment scores for the last 3 days relevant to the query.
	- This "Trend Knowledge" is injected into the system prompt.
	- Benefit: The LLM can say: "Retrieval articles show X, but ClickHouse trends show that sentiment for this topic is currently shifting negative."

	### Step G: Streaming Generation - SSE (Real-Time UX)
	Purpose: Minimizing "Perceived Latency".
	- Process:
	- Uses FastAPI `StreamingResponse` and Server-Sent Events (SSE).
	- Instead of waiting 5 seconds for a full paragraph, the first token is displayed within 200-400ms.
	- Tokens are pushed to the client in real-time as the LLM predicts them.

	---

	## 5. Traceability & Feedback Loop

	### Step H: Interaction Logging (Audit Trail)
	- Traceability: Every AI response logs the exact list of `retrieved_doc_ids` (Source IDs) in PostgreSQL.
	- Learning Loop: When a user gives a "Thumbs Down", developers can query the database to see exactly which sources were used. This allows for Negative Sampling (identifying which articles cause hallucination or bad answers).

	---

	## Technical Stack Overview

	\| Stage \| Tool/Model \|
	\| :--- \| :--- \|
	\| Embeddings \| `BAAI/bge-m3` (BAAI) \|
	\| Reranking \| `ms-marco-TinyBERT-L-2-v2` (CrossEncoder) \|
	\| Diversity \| Custom MMR Implementation \|
	\| Vector DB \| Qdrant \|
	\| Data Warehouse\| ClickHouse \|
	\| Token Control \| `tiktoken` (cl100k_base) \|
	\| LLM \| OpenAI `gpt-4` \|

	---

	## Full Data Flow Visual

	```mermaid
	graph TD
	User((User)) -->\|Query\| API[RAG API]
	API -->\|Prompt\| LLM_Rewriter[LLM Rewriter]
	LLM_Rewriter -->\|Standalone Query\| API

	API -.->\|Circuit Breaker Check\| VDB{Qdrant Online?}
	VDB -->\|No\| CH_FB[ClickHouse Keyword Fallback]
	VDB -->\|Yes\| V_Search[Hybrid Vector Search]

	V_Search -->\|Top 20\| Rerank[Cross-Encoder Reranker]
	Rerank -->\|Diversity Pass\| MMR[MMR Filter]
	MMR -->\|Top K\| Parent_Fetch[Parent Doc Retrieval]

	Parent_Fetch -->\|Context\| Prompt_Build[Prompt Construction]
	Prompt_Build -->\|Inject\| CH_Trends[ClickHouse Trends]

	CH_Trends -->\|Full Prompt\| LLM_Stream[LLM Streaming]
	LLM_Stream -->\|SSE Tokens\| User

	LLM_Stream -->\|Trace\| Postgres[(Interaction DB)]
	```