rag-api-node-1 / docs /rag_retrieval_presentation.md
Peterase's picture
feat(rag): implement hybrid search with live sources and production-grade intent classification
a63c61f
metadata
marp: true
theme: default
paginate: true
header: Enterprise RAG Retrieval Architecture
footer: Hexagonal Architecture Data Flow

🚀 The Enterprise RAG Retrieval Logic

Step-by-Step Data Flow Analysis

This presentation covers the exact 9-step semantic retrieval and orchestration sequence used by the API to process complex user queries.

Case Study Query: "What happened with Apple stock recently?"


1️⃣ Step 1: Ingestion & Intent Routing

The front door of our architecture. Every request is intercepted by the Agent Router to prevent unnecessary Vector Database queries.

  • Component: agent_router_use_case.py
  • Input Object: ChatRequest(query="What happened with Apple stock recently?", top_k=5)
  • LLM Classification Prompt: "Is this a NEWS search or an ACCOUNT search?"
  • Action: The LLM analyzes the text and confidently outputs NEWS.
  • Output Routing: The Router dynamically forwards the payload to the specialized RagChatUseCase.

2️⃣ Step 2: Semantic Caching Layer

Before spending LLM tokens or Cloud Compute, we check if this exact question has been asked and answered recently.

  • Component: redis_adapter.py
  • Action: cache_port.generate_exact_hash() deterministically calculates a SHA-256 hash representing the query string.
  • Cache Check: Does the key exist in the Redis cluster?
  • Fast-Path: If Yes, it returns the cached generation instantly, resulting in 0ms LLM time and $0 cost.
  • Deep-Path: If No, the query proceeds down the expensive RAG pipeline.

3️⃣ Step 3: Self-Query Extraction

We translate the user's natural language into strict physical constraints and metadata filters for the database.

  • Component: rag_chat_use_case.py -> _extract_intents()
  • Action: The LLM parses the user text against available metadata schemas.
  • Execution Insight: The LLM identifies the word "recently" and maps it to a physical timeframe.
  • LLM Output (JSON):
    { "days_back": 3, "source": null }
    
  • Mapping: RagChatUseCase creates a Qdrant models.Filter from this JSON, excluding old documents before math occurs.

4️⃣ Step 4: Text Vectorization

We convert the query string into a mathematical representation using the massive BGE-M3 model.

  • Component: bge_embedder_adapter.py
  • Action: encode_query() passes the text into the embedded ML model.
  • Model Processing: The text is tokenized into both Dense and Sparse dimensions.
  • Output Architecture:
    • Dense Array: [0.123, -0.456, 0.789, ... 1024 dimensions]
    • Sparse Lexical: {"indices": [102, 451, ...], "values": [0.92, 0.44, ...]}

5️⃣ Step 5: Hybrid Vector Search

We execute a high-performance database search combining math and exact keyword matching.

  • Component: qdrant_adapter.py
  • Action: Sends query_vectors and the extracted days_back=3 physical filter to Qdrant via vector_store_port.search().
  • Database Processing: Qdrant executes a Reciprocal Rank Fusion (RRF) query. It searches simultaneously for Semantic Meaning (Dense) and Exact Keyword Hits (Sparse).
  • Yield: Returns the top 20 nearest neighbor SearchResult documents.

6️⃣ Step 6: Temporal Bias Scoring

Preventing historical hallucination by mathematically prioritizing fresh news over old news.

  • Component: rag_chat_use_case.py -> _build_context()
  • Action: Iterates over every returned document and examines its published_at timestamp.
  • Mathematical Decay:
    • score_multiplier = max(0.5, 1.0 - (days_old / 60))
    • The older the article, the lower its multiplier goes.
  • Output: A freshly re-scored list where newer, slightly less-relevant articles can outrank old, highly-relevant articles.

7️⃣ Step 7: Cross-Encoder Reranking

Applying an absolute brute-force semantic check to eliminate hallucinated vector distances.

  • Component: bge_reranker_adapter.py
  • Action: Takes the top 20 decayed documents. It physically pairs the Query against the Document text block-by-block.
    • [[query, doc1_text], [query, doc2_text], ...]
  • Model Processing: The HuggingFace FlagReranker calculates exact semantic overlap.
  • Output: Only the strict Top 5 (top_k) highest-scoring documents survive.

8️⃣ Step 8: Contextual Compression

Squashing massive strings to fit gracefully into limited LLM context windows.

  • Component: rag_chat_use_case.py -> _limit_context()
  • Action: Uses tiktoken to calculate the total length of the surviving Top 5 documents.
  • Compression Loop: If the size exceeds 3000 tokens, it pipes overflowing documents individually to an LLM via _compress_document().
  • Extraction: The LLM digests 800 words and outputs only bulleted facts relevant to "Apple Stock".
  • Output: A high-density, tightly packed context_text string.

9️⃣ Step 9: Final Final Generation

The Orchestrator fuses all pipelines to deliver a hyper-accurate, hallucination-free answer.

  • Component: llm_port.py
  • Action: The packed context_text, the original query, and the user's Chat History are injected into a singular Prompt Template.
  • Generation: The LLM interprets the verified facts.
    • "Apple stock surged 4% after the latest earnings report..."
  • Final Cleanup: The new answer string is permanently logged into Postgres (chat_history) and cached into Redis (cache) before being returned via the API.