Spaces:
Running
marp: true
theme: default
paginate: true
header: Enterprise RAG Retrieval Architecture
footer: Hexagonal Architecture Data Flow
🚀 The Enterprise RAG Retrieval Logic
Step-by-Step Data Flow Analysis
This presentation covers the exact 9-step semantic retrieval and orchestration sequence used by the API to process complex user queries.
Case Study Query: "What happened with Apple stock recently?"
1️⃣ Step 1: Ingestion & Intent Routing
The front door of our architecture. Every request is intercepted by the Agent Router to prevent unnecessary Vector Database queries.
- Component:
agent_router_use_case.py - Input Object:
ChatRequest(query="What happened with Apple stock recently?", top_k=5) - LLM Classification Prompt: "Is this a NEWS search or an ACCOUNT search?"
- Action: The LLM analyzes the text and confidently outputs
NEWS. - Output Routing: The Router dynamically forwards the payload to the specialized
RagChatUseCase.
2️⃣ Step 2: Semantic Caching Layer
Before spending LLM tokens or Cloud Compute, we check if this exact question has been asked and answered recently.
- Component:
redis_adapter.py - Action:
cache_port.generate_exact_hash()deterministically calculates a SHA-256 hash representing the query string. - Cache Check: Does the key exist in the Redis cluster?
- Fast-Path: If Yes, it returns the cached generation instantly, resulting in 0ms LLM time and $0 cost.
- Deep-Path: If No, the query proceeds down the expensive RAG pipeline.
3️⃣ Step 3: Self-Query Extraction
We translate the user's natural language into strict physical constraints and metadata filters for the database.
- Component:
rag_chat_use_case.py -> _extract_intents() - Action: The LLM parses the user text against available metadata schemas.
- Execution Insight: The LLM identifies the word "recently" and maps it to a physical timeframe.
- LLM Output (JSON):
{ "days_back": 3, "source": null } - Mapping:
RagChatUseCasecreates a Qdrantmodels.Filterfrom this JSON, excluding old documents before math occurs.
4️⃣ Step 4: Text Vectorization
We convert the query string into a mathematical representation using the massive BGE-M3 model.
- Component:
bge_embedder_adapter.py - Action:
encode_query()passes the text into the embedded ML model. - Model Processing: The text is tokenized into both Dense and Sparse dimensions.
- Output Architecture:
- Dense Array:
[0.123, -0.456, 0.789, ... 1024 dimensions] - Sparse Lexical:
{"indices": [102, 451, ...], "values": [0.92, 0.44, ...]}
- Dense Array:
5️⃣ Step 5: Hybrid Vector Search
We execute a high-performance database search combining math and exact keyword matching.
- Component:
qdrant_adapter.py - Action: Sends
query_vectorsand the extracteddays_back=3physical filter to Qdrant viavector_store_port.search(). - Database Processing: Qdrant executes a Reciprocal Rank Fusion (RRF) query. It searches simultaneously for Semantic Meaning (Dense) and Exact Keyword Hits (Sparse).
- Yield: Returns the top 20 nearest neighbor
SearchResultdocuments.
6️⃣ Step 6: Temporal Bias Scoring
Preventing historical hallucination by mathematically prioritizing fresh news over old news.
- Component:
rag_chat_use_case.py -> _build_context() - Action: Iterates over every returned document and examines its
published_attimestamp. - Mathematical Decay:
score_multiplier = max(0.5, 1.0 - (days_old / 60))- The older the article, the lower its multiplier goes.
- Output: A freshly re-scored list where newer, slightly less-relevant articles can outrank old, highly-relevant articles.
7️⃣ Step 7: Cross-Encoder Reranking
Applying an absolute brute-force semantic check to eliminate hallucinated vector distances.
- Component:
bge_reranker_adapter.py - Action: Takes the top 20 decayed documents. It physically pairs the Query against the Document text block-by-block.
[[query, doc1_text], [query, doc2_text], ...]
- Model Processing: The HuggingFace FlagReranker calculates exact semantic overlap.
- Output: Only the strict Top 5 (
top_k) highest-scoring documents survive.
8️⃣ Step 8: Contextual Compression
Squashing massive strings to fit gracefully into limited LLM context windows.
- Component:
rag_chat_use_case.py -> _limit_context() - Action: Uses
tiktokento calculate the total length of the surviving Top 5 documents. - Compression Loop: If the size exceeds 3000 tokens, it pipes overflowing documents individually to an LLM via
_compress_document(). - Extraction: The LLM digests 800 words and outputs only bulleted facts relevant to "Apple Stock".
- Output: A high-density, tightly packed
context_textstring.
9️⃣ Step 9: Final Final Generation
The Orchestrator fuses all pipelines to deliver a hyper-accurate, hallucination-free answer.
- Component:
llm_port.py - Action: The packed
context_text, the originalquery, and the user'sChat Historyare injected into a singular Prompt Template. - Generation: The LLM interprets the verified facts.
- "Apple stock surged 4% after the latest earnings report..."
- Final Cleanup: The new answer string is permanently logged into Postgres (
chat_history) and cached into Redis (cache) before being returned via the API.