Spaces:

Peterase
/

rag-api-node-1

Running

App Files Files Community

rag-api-node-1 / docs /rag_retrieval_presentation.md

Peterase

feat(rag): implement hybrid search with live sources and production-grade intent classification

a63c61f 10 days ago

preview code

raw

history blame contribute delete

5.55 kB

	---
	marp: true
	theme: default
	paginate: true
	header: 'Enterprise RAG Retrieval Architecture'
	footer: 'Hexagonal Architecture Data Flow'
	---

	# 🚀 The Enterprise RAG Retrieval Logic
	### Step-by-Step Data Flow Analysis

	This presentation covers the exact 9-step semantic retrieval and orchestration sequence used by the API to process complex user queries.

	Case Study Query: "What happened with Apple stock recently?"

	---

	# 1️⃣ Step 1: Ingestion & Intent Routing

	The front door of our architecture. Every request is intercepted by the Agent Router to prevent unnecessary Vector Database queries.

	- Component: `agent_router_use_case.py`
	- Input Object: `ChatRequest(query="What happened with Apple stock recently?", top_k=5)`
	- LLM Classification Prompt: "Is this a NEWS search or an ACCOUNT search?"
	- Action: The LLM analyzes the text and confidently outputs `NEWS`.
	- Output Routing: The Router dynamically forwards the payload to the specialized `RagChatUseCase`.

	---

	# 2️⃣ Step 2: Semantic Caching Layer

	Before spending LLM tokens or Cloud Compute, we check if this exact question has been asked and answered recently.

	- Component: `redis_adapter.py`
	- Action: `cache_port.generate_exact_hash()` deterministically calculates a SHA-256 hash representing the query string.
	- Cache Check: Does the key exist in the Redis cluster?
	- Fast-Path: If Yes, it returns the cached generation instantly, resulting in 0ms LLM time and $0 cost.
	- Deep-Path: If No, the query proceeds down the expensive RAG pipeline.

	---

	# 3️⃣ Step 3: Self-Query Extraction

	We translate the user's natural language into strict physical constraints and metadata filters for the database.

	- Component: `rag_chat_use_case.py -> _extract_intents()`
	- Action: The LLM parses the user text against available metadata schemas.
	- Execution Insight: The LLM identifies the word "recently" and maps it to a physical timeframe.
	- LLM Output (JSON):
	```json
	{ "days_back": 3, "source": null }
	```
	- Mapping: `RagChatUseCase` creates a Qdrant `models.Filter` from this JSON, excluding old documents before math occurs.

	---

	# 4️⃣ Step 4: Text Vectorization

	We convert the query string into a mathematical representation using the massive BGE-M3 model.

	- Component: `bge_embedder_adapter.py`
	- Action: `encode_query()` passes the text into the embedded ML model.
	- Model Processing: The text is tokenized into both Dense and Sparse dimensions.
	- Output Architecture:
	- Dense Array: `[0.123, -0.456, 0.789, ... 1024 dimensions]`
	- Sparse Lexical: `{"indices": [102, 451, ...], "values": [0.92, 0.44, ...]}`

	---

	# 5️⃣ Step 5: Hybrid Vector Search

	We execute a high-performance database search combining math and exact keyword matching.

	- Component: `qdrant_adapter.py`
	- Action: Sends `query_vectors` and the extracted `days_back=3` physical filter to Qdrant via `vector_store_port.search()`.
	- Database Processing: Qdrant executes a Reciprocal Rank Fusion (RRF) query. It searches simultaneously for Semantic Meaning (Dense) and Exact Keyword Hits (Sparse).
	- Yield: Returns the top 20 nearest neighbor `SearchResult` documents.

	---

	# 6️⃣ Step 6: Temporal Bias Scoring

	Preventing historical hallucination by mathematically prioritizing fresh news over old news.

	- Component: `rag_chat_use_case.py -> _build_context()`
	- Action: Iterates over every returned document and examines its `published_at` timestamp.
	- Mathematical Decay:
	- `score_multiplier = max(0.5, 1.0 - (days_old / 60))`
	- The older the article, the lower its multiplier goes.
	- Output: A freshly re-scored list where newer, slightly less-relevant articles can outrank old, highly-relevant articles.

	---

	# 7️⃣ Step 7: Cross-Encoder Reranking

	Applying an absolute brute-force semantic check to eliminate hallucinated vector distances.

	- Component: `bge_reranker_adapter.py`
	- Action: Takes the top 20 decayed documents. It physically pairs the Query against the Document text block-by-block.
	- `[[query, doc1_text], [query, doc2_text], ...]`
	- Model Processing: The HuggingFace FlagReranker calculates exact semantic overlap.
	- Output: Only the strict Top 5 (`top_k`) highest-scoring documents survive.

	---

	# 8️⃣ Step 8: Contextual Compression

	Squashing massive strings to fit gracefully into limited LLM context windows.

	- Component: `rag_chat_use_case.py -> _limit_context()`
	- Action: Uses `tiktoken` to calculate the total length of the surviving Top 5 documents.
	- Compression Loop: If the size exceeds 3000 tokens, it pipes overflowing documents individually to an LLM via `_compress_document()`.
	- Extraction: The LLM digests 800 words and outputs only bulleted facts relevant to "Apple Stock".
	- Output: A high-density, tightly packed `context_text` string.

	---

	# 9️⃣ Step 9: Final Final Generation

	The Orchestrator fuses all pipelines to deliver a hyper-accurate, hallucination-free answer.

	- Component: `llm_port.py`
	- Action: The packed `context_text`, the original `query`, and the user's `Chat History` are injected into a singular Prompt Template.
	- Generation: The LLM interprets the verified facts.
	- "Apple stock surged 4% after the latest earnings report..."
	- Final Cleanup: The new answer string is permanently logged into Postgres (`chat_history`) and cached into Redis (`cache`) before being returned via the API.

	---
	marp: true
	theme: default
	paginate: true
	header: 'Enterprise RAG Retrieval Architecture'
	footer: 'Hexagonal Architecture Data Flow'
	---

	# 🚀 The Enterprise RAG Retrieval Logic
	### Step-by-Step Data Flow Analysis

	This presentation covers the exact 9-step semantic retrieval and orchestration sequence used by the API to process complex user queries.

	Case Study Query: "What happened with Apple stock recently?"

	---

	# 1️⃣ Step 1: Ingestion & Intent Routing

	The front door of our architecture. Every request is intercepted by the Agent Router to prevent unnecessary Vector Database queries.

	- Component: `agent_router_use_case.py`
	- Input Object: `ChatRequest(query="What happened with Apple stock recently?", top_k=5)`
	- LLM Classification Prompt: "Is this a NEWS search or an ACCOUNT search?"
	- Action: The LLM analyzes the text and confidently outputs `NEWS`.
	- Output Routing: The Router dynamically forwards the payload to the specialized `RagChatUseCase`.

	---

	# 2️⃣ Step 2: Semantic Caching Layer

	Before spending LLM tokens or Cloud Compute, we check if this exact question has been asked and answered recently.

	- Component: `redis_adapter.py`
	- Action: `cache_port.generate_exact_hash()` deterministically calculates a SHA-256 hash representing the query string.
	- Cache Check: Does the key exist in the Redis cluster?
	- Fast-Path: If Yes, it returns the cached generation instantly, resulting in 0ms LLM time and $0 cost.
	- Deep-Path: If No, the query proceeds down the expensive RAG pipeline.

	---

	# 3️⃣ Step 3: Self-Query Extraction

	We translate the user's natural language into strict physical constraints and metadata filters for the database.

	- Component: `rag_chat_use_case.py -> _extract_intents()`
	- Action: The LLM parses the user text against available metadata schemas.
	- Execution Insight: The LLM identifies the word "recently" and maps it to a physical timeframe.
	- LLM Output (JSON):
	```json
	{ "days_back": 3, "source": null }
	```
	- Mapping: `RagChatUseCase` creates a Qdrant `models.Filter` from this JSON, excluding old documents before math occurs.

	---

	# 4️⃣ Step 4: Text Vectorization

	We convert the query string into a mathematical representation using the massive BGE-M3 model.

	- Component: `bge_embedder_adapter.py`
	- Action: `encode_query()` passes the text into the embedded ML model.
	- Model Processing: The text is tokenized into both Dense and Sparse dimensions.
	- Output Architecture:
	- Dense Array: `[0.123, -0.456, 0.789, ... 1024 dimensions]`
	- Sparse Lexical: `{"indices": [102, 451, ...], "values": [0.92, 0.44, ...]}`

	---

	# 5️⃣ Step 5: Hybrid Vector Search

	We execute a high-performance database search combining math and exact keyword matching.

	- Component: `qdrant_adapter.py`
	- Action: Sends `query_vectors` and the extracted `days_back=3` physical filter to Qdrant via `vector_store_port.search()`.
	- Database Processing: Qdrant executes a Reciprocal Rank Fusion (RRF) query. It searches simultaneously for Semantic Meaning (Dense) and Exact Keyword Hits (Sparse).
	- Yield: Returns the top 20 nearest neighbor `SearchResult` documents.

	---

	# 6️⃣ Step 6: Temporal Bias Scoring

	Preventing historical hallucination by mathematically prioritizing fresh news over old news.

	- Component: `rag_chat_use_case.py -> _build_context()`
	- Action: Iterates over every returned document and examines its `published_at` timestamp.
	- Mathematical Decay:
	- `score_multiplier = max(0.5, 1.0 - (days_old / 60))`
	- The older the article, the lower its multiplier goes.
	- Output: A freshly re-scored list where newer, slightly less-relevant articles can outrank old, highly-relevant articles.

	---

	# 7️⃣ Step 7: Cross-Encoder Reranking

	Applying an absolute brute-force semantic check to eliminate hallucinated vector distances.

	- Component: `bge_reranker_adapter.py`
	- Action: Takes the top 20 decayed documents. It physically pairs the Query against the Document text block-by-block.
	- `[[query, doc1_text], [query, doc2_text], ...]`
	- Model Processing: The HuggingFace FlagReranker calculates exact semantic overlap.
	- Output: Only the strict Top 5 (`top_k`) highest-scoring documents survive.

	---

	# 8️⃣ Step 8: Contextual Compression

	Squashing massive strings to fit gracefully into limited LLM context windows.

	- Component: `rag_chat_use_case.py -> _limit_context()`
	- Action: Uses `tiktoken` to calculate the total length of the surviving Top 5 documents.
	- Compression Loop: If the size exceeds 3000 tokens, it pipes overflowing documents individually to an LLM via `_compress_document()`.
	- Extraction: The LLM digests 800 words and outputs only bulleted facts relevant to "Apple Stock".
	- Output: A high-density, tightly packed `context_text` string.

	---

	# 9️⃣ Step 9: Final Final Generation

	The Orchestrator fuses all pipelines to deliver a hyper-accurate, hallucination-free answer.

	- Component: `llm_port.py`
	- Action: The packed `context_text`, the original `query`, and the user's `Chat History` are injected into a singular Prompt Template.
	- Generation: The LLM interprets the verified facts.
	- "Apple stock surged 4% after the latest earnings report..."
	- Final Cleanup: The new answer string is permanently logged into Postgres (`chat_history`) and cached into Redis (`cache`) before being returned via the API.