Spaces:
Running
Running
| marp: true | |
| theme: default | |
| paginate: true | |
| header: 'Enterprise RAG Retrieval Architecture' | |
| footer: 'Hexagonal Architecture Data Flow' | |
| # 🚀 The Enterprise RAG Retrieval Logic | |
| ### Step-by-Step Data Flow Analysis | |
| This presentation covers the exact 9-step semantic retrieval and orchestration sequence used by the API to process complex user queries. | |
| **Case Study Query**: *"What happened with Apple stock recently?"* | |
| --- | |
| # 1️⃣ Step 1: Ingestion & Intent Routing | |
| The front door of our architecture. Every request is intercepted by the **Agent Router** to prevent unnecessary Vector Database queries. | |
| - **Component**: `agent_router_use_case.py` | |
| - **Input Object**: `ChatRequest(query="What happened with Apple stock recently?", top_k=5)` | |
| - **LLM Classification Prompt**: *"Is this a NEWS search or an ACCOUNT search?"* | |
| - **Action**: The LLM analyzes the text and confidently outputs `NEWS`. | |
| - **Output Routing**: The Router dynamically forwards the payload to the specialized `RagChatUseCase`. | |
| --- | |
| # 2️⃣ Step 2: Semantic Caching Layer | |
| Before spending LLM tokens or Cloud Compute, we check if this exact question has been asked and answered recently. | |
| - **Component**: `redis_adapter.py` | |
| - **Action**: `cache_port.generate_exact_hash()` deterministically calculates a SHA-256 hash representing the query string. | |
| - **Cache Check**: Does the key exist in the Redis cluster? | |
| - **Fast-Path**: If **Yes**, it returns the cached generation instantly, resulting in 0ms LLM time and $0 cost. | |
| - **Deep-Path**: If **No**, the query proceeds down the expensive RAG pipeline. | |
| --- | |
| # 3️⃣ Step 3: Self-Query Extraction | |
| We translate the user's natural language into strict physical constraints and metadata filters for the database. | |
| - **Component**: `rag_chat_use_case.py -> _extract_intents()` | |
| - **Action**: The LLM parses the user text against available metadata schemas. | |
| - **Execution Insight**: The LLM identifies the word *"recently"* and maps it to a physical timeframe. | |
| - **LLM Output (JSON)**: | |
| ```json | |
| { "days_back": 3, "source": null } | |
| ``` | |
| - **Mapping**: `RagChatUseCase` creates a Qdrant `models.Filter` from this JSON, excluding old documents before math occurs. | |
| --- | |
| # 4️⃣ Step 4: Text Vectorization | |
| We convert the query string into a mathematical representation using the massive BGE-M3 model. | |
| - **Component**: `bge_embedder_adapter.py` | |
| - **Action**: `encode_query()` passes the text into the embedded ML model. | |
| - **Model Processing**: The text is tokenized into both Dense and Sparse dimensions. | |
| - **Output Architecture**: | |
| - **Dense Array**: `[0.123, -0.456, 0.789, ... 1024 dimensions]` | |
| - **Sparse Lexical**: `{"indices": [102, 451, ...], "values": [0.92, 0.44, ...]}` | |
| --- | |
| # 5️⃣ Step 5: Hybrid Vector Search | |
| We execute a high-performance database search combining math and exact keyword matching. | |
| - **Component**: `qdrant_adapter.py` | |
| - **Action**: Sends `query_vectors` and the extracted `days_back=3` physical filter to Qdrant via `vector_store_port.search()`. | |
| - **Database Processing**: Qdrant executes a **Reciprocal Rank Fusion (RRF)** query. It searches simultaneously for Semantic Meaning (Dense) and Exact Keyword Hits (Sparse). | |
| - **Yield**: Returns the top 20 nearest neighbor `SearchResult` documents. | |
| --- | |
| # 6️⃣ Step 6: Temporal Bias Scoring | |
| Preventing historical hallucination by mathematically prioritizing fresh news over old news. | |
| - **Component**: `rag_chat_use_case.py -> _build_context()` | |
| - **Action**: Iterates over every returned document and examines its `published_at` timestamp. | |
| - **Mathematical Decay**: | |
| - `score_multiplier = max(0.5, 1.0 - (days_old / 60))` | |
| - The older the article, the lower its multiplier goes. | |
| - **Output**: A freshly re-scored list where newer, slightly less-relevant articles can outrank old, highly-relevant articles. | |
| --- | |
| # 7️⃣ Step 7: Cross-Encoder Reranking | |
| Applying an absolute brute-force semantic check to eliminate hallucinated vector distances. | |
| - **Component**: `bge_reranker_adapter.py` | |
| - **Action**: Takes the top 20 decayed documents. It physically pairs the Query against the Document text block-by-block. | |
| - `[[query, doc1_text], [query, doc2_text], ...]` | |
| - **Model Processing**: The HuggingFace FlagReranker calculates exact semantic overlap. | |
| - **Output**: Only the strict Top 5 (`top_k`) highest-scoring documents survive. | |
| --- | |
| # 8️⃣ Step 8: Contextual Compression | |
| Squashing massive strings to fit gracefully into limited LLM context windows. | |
| - **Component**: `rag_chat_use_case.py -> _limit_context()` | |
| - **Action**: Uses `tiktoken` to calculate the total length of the surviving Top 5 documents. | |
| - **Compression Loop**: If the size exceeds 3000 tokens, it pipes overflowing documents individually to an LLM via `_compress_document()`. | |
| - **Extraction**: The LLM digests 800 words and outputs only bulleted facts relevant to "Apple Stock". | |
| - **Output**: A high-density, tightly packed `context_text` string. | |
| --- | |
| # 9️⃣ Step 9: Final Final Generation | |
| The Orchestrator fuses all pipelines to deliver a hyper-accurate, hallucination-free answer. | |
| - **Component**: `llm_port.py` | |
| - **Action**: The packed `context_text`, the original `query`, and the user's `Chat History` are injected into a singular Prompt Template. | |
| - **Generation**: The LLM interprets the verified facts. | |
| - *"Apple stock surged 4% after the latest earnings report..."* | |
| - **Final Cleanup**: The new answer string is permanently logged into Postgres (`chat_history`) and cached into Redis (`cache`) before being returned via the API. | |