rag-api-node-1 / docs /rag_retrieval_documentation.md
Peterase's picture
feat(rag): implement hybrid search with live sources and production-grade intent classification
a63c61f

RAG API Data Flow & Retrieval Architecture

This document tracks the detailed Data Flow of the RAG (Retrieval-Augmented Generation) API, with a specific focus on the Retrieval Logic. Rather than just listing HTTP endpoints, this document explains the underlying methods, conceptual flow, and how the Domain Models, Ports, Use Cases, and Infrastructure Adapters interact to fetch, rerank, and summarize enterprise news data.


πŸ—οΈ 1. Architecture Overview (Hexagonal Architecture)

The RAG API relies on Hexagonal Architecture (Ports and Adapters). It strongly separates business logic from infrastructure frameworks.

  • Domain/Models: The central, pure data structures representing the state (e.g., ChatRequest, User).
  • Ports (Interfaces): Abstract definitions of what the system needs to do (e.g., VectorStorePort, LlmPort).
  • Use Cases: The actual business logic where the retrieval steps, filtering, and flow occur.
  • Adapters: The concrete implementation of Ports using external technologies (e.g., Qdrant, OpenAI, Redis, Postgres).

πŸ“‚ 2. File Directory Breakdown & Responsibilities

src/api/ (Primary Adapters / The Front Door)

  • routes/rag.py: Exposes the /chat and /chat/stream endpoints. Role: Accepts the incoming HTTP payload, validates the JWT token (via Depends(get_current_user)), and forwards the request directly to the AgentRouterUseCase.
  • dependencies.py: The Dependency Injection container. Role: Wires the concrete Infrastructure Adapters (e.g., QdrantAdapter, BgeEmbedderAdapter) to their respective Ports, and injects them into the Use Cases. Ensures components are instantiated only once.

src/core/domain/ (Core Data)

  • schemas.py: Defines Pydantic validation models. Role: Houses ChatRequest (contains query, top_k, session_id, source_filter, etc.) which acts as the transport object through the system.

src/core/ports/ (The Interfaces)

  • embedder_port.py: Defines encode_query().
  • vector_store_port.py: Defines search().
  • reranker_port.py: Defines rerank().
  • llm_port.py: Defines generate() and generate_stream().
  • cache_port.py: Defines get(), set(), and generate_exact_hash().

src/core/use_cases/ (The Business Logic Engine)

  • agent_router_use_case.py: Role: The gateway. Analyzes the user's intent. Routes the request to AccountUseCase (if the user is asking about personal profile data) or RagChatUseCase (if asking about news).
  • rag_chat_use_case.py: Role: The Heavy Lifter. Responsible for the entire Retrieval Logic flow. Contains methods like _extract_intents, _build_context, _limit_context, and _compress_document.
  • account_use_case.py: Role: A secondary flow for handling user-specific DB aggregations (billing, history) rather than searching Vector DBs.

src/infrastructure/adapters/ (Concrete Infrastructure)

  • redis_adapter.py: Role: Connects to the caching layer to prevent duplicate LLM processing calls.
  • qdrant_adapter.py: Role: Orchestrates the query_points API call to Qdrant, fusing Dense and Sparse vector retrieval (Hybrid Search).
  • bge_embedder_adapter.py: Role: Instantiates the massive BGE-M3 model (using FlagEmbedding). Converts text strings into multi-dimensional arrays (Dense and Lexical Sparse weights).
  • bge_reranker_adapter.py: Role: Uses a Cross-Encoder to compare the user query and the retrieved documents string-by-string for absolute semantic precision.
  • openai_adapter.py / ollama_adapter.py: Role: Connects to an external OpenAI API or Local Llama-3 instance to generate text.

🌊 3. The Retrieval Logic: Step-by-Step Data Flow Example

Scenario: A user submits the query: "What happened with Apple stock recently?"

Step 1: Ingestion & Intent Routing (agent_router_use_case.py)

  1. Input: ChatRequest(query="What happened with Apple stock recently?", top_k=5)
  2. Action: The API endpoint passes this to the AgentRouterUseCase.
  3. LLM Classification: The Router asks the LLM: "Is this a NEWS search or an ACCOUNT search?"
  4. Output: The LLM outputs NEWS. The Router forwards the request to the RagChatUseCase.

Step 2: Semantic Caching (redis_adapter.py)

  1. Action: cache_port.generate_exact_hash() calculates an SHA-256 hash or deterministic key for the query string.
  2. Check: Does this key exist in Redis?
  3. If Yes: Return the answer instantly (0ms LLM time).
  4. If No: Proceed with the expensive pipeline.

Step 3: Self-Query Extraction (rag_chat_use_case.py -> _extract_intents())

  1. Action: The LLM analyzes the user's natural language query to dynamically extract metadata and physical parameters for the vector database.
  2. Example Prompting: The LLM is provided with a system prompt like: "Extract the temporal constraints and target sources from the user query into JSON format. Valid sources: ['reuters', 'bloomberg']."
  3. Execution: The LLM analyzes "What happened with Apple stock recently?"
  4. Output Deduction: From the word "recently", it deduces the temporal boundary and constructs the following JSON structure:
    {
        "days_back": 3,
        "source": null
    }
    
  5. Mapping: The RagChatUseCase parses this JSON. If days_back is present, it constructs a Qdrant models.Filter to physically exclude older documents from the multidimensional search space before the costly vector math occurs.

Step 4: Embedding / Vectorization (bge_embedder_adapter.py)

  1. Action: encode_query() is called.
  2. Model Processing: The BGE-M3 model tokenizes the string.
  3. Output: Returns a Dict containing:
    • dense: [0.123, -0.456, 0.789, ... 1024 dimensions]
    • sparse: {"indices": [102, 451, ...], "values": [0.92, 0.44, ...]}

Step 5: Hybrid Vector Search (qdrant_adapter.py)

  1. Action: Passes the query_vectors and the days_back=3 filter into vector_store_port.search().
  2. Qdrant Processing: Qdrant performs a Fusion Query (Reciprocal Rank Fusion - RRF). It fetches the top 20 nearest neighbors from BOTH the Dense mathematical space AND the Sparse keyword space.
  3. Output: Returns a List of raw SearchResult documents.

Step 6: Temporal Bias Scoring (rag_chat_use_case.py -> _build_context())

  1. Action: Evaluates the published_at metadata of every hit.
  2. Calculation: It deliberately decays the score of older articles via a mathematical multiplier (e.g., score_multiplier = max(0.5, 1.0 - (days_old / 60))).
  3. Output: A dynamically re-scored list, preferring fresh data.

Step 7: Cross-Encoder Reranking (bge_reranker_adapter.py)

  1. Action: For the top 20 remaining documents, the Reranker pairs the Query + Document Text together ([[query, doc1], [query, doc2]]).
  2. Model Processing: The HuggingFace FlagReranker calculates exact semantic overlap.
  3. Output: Returns the strict Top 5 (top_k) documents, guaranteed to be specifically relevant.

Step 8: Contextual Compression (rag_chat_use_case.py -> _limit_context())

  1. Action: _limit_context uses tiktoken to count how many tokens the Top 5 documents contain.
  2. Check: Are they over the 3000 Token limit?
  3. Compression Loop: If they are over the limit, it calls _compress_document().
  4. LLM Summarization: Passes the overflowing document string to the LLM with the instruction: "Extract pure facts... relevant to the query." The massive document strings are squashed down to bullet-point facts.
  5. Output: A tightly packed context_text string ready for generation.

Step 9: Final Generation (llm_port.py)

  1. Action: The packed context_text, the User query, and the recent Chat History are combined into the Final Prompt.
  2. Model Processing: The LLM interprets the compressed context.
  3. Output: The Final string ("Apple stock surged 4% after the latest earnings report...").
  4. Cleanup: This answer is saved to both Postgres (chat_history_db) and Redis (cache), and returned to the API client.

πŸ“ˆ 4. A4 Analysis and Future Updates

A4 Analysis (Current System Standing)

Dimension Analysis & Findings
Resilience & Scalability High. The Hexagonal architecture successfully decoupled Qdrant, Postgres, and the LLMs. We can swap OpenAiAdapter for OllamaAdapter simply by changing one dependency provider without touching the Business Logic flow. Missing dependencies (e.g., FlagEmbedding) gracefully utilize dummy fallbacks avoiding hard API crashes.
Retrieval Accuracy Exceptional. We utilize a 3-Stage filtering mechanism: Semantic similarity (Dense), Lexical accuracy (Sparse), and absolute context alignment (Reranker). The addition of dynamic Temporal Biasing prevents the hallucination of historical news as current events.
Cost & Latency Management Optimized. The implementation of Redis Semantic Caching guarantees that recursive identical intent avoids LLM round-trip costs. The AgentRouterUseCase ensures unrelated general questions (Account, Billing) never touch expensive Vector DB aggregations.
Memory Constraint Handling Innovative. By employing _compress_document, the system prevents context-window truncation, ensuring critical tail-end entities still influence the LLM's final generation.

Proposed Future Updates (Roadmap)

  1. Semantic Cache Refinement: Currently, the RedisAdapter relies on an exact SHA-256 string hash. Update: Calculate an actual LLM embedding of the prompt (Dense Vector) and store it in Redis. Use a Cosine-Similarity threshold (>0.95) to intercept semantically identical (but textually different) questions (e.g., "Apple stock" vs "AAPL share price").
  2. Analytic Trend Fusion Enhancement: In _build_context, we fetch trending entities from ClickHouse. Update: Send these trending entities into the Agent Router so the system can proactively recommend or correlate user interactions with macroeconomic spikes before they ask.
  3. Ollama Deployment Readiness: Test the bge_embedder_adapter and bge_reranker_adapter simultaneously against an active OllamaAdapter container to benchmark hardware-level VRAM bottlenecks on local inference machines.
  4. Knowledge Graph Integration: Extract Triples (Subject-Predicate-Object) during the _compress_document step to progressively construct a Graph Database (Neo4j) alongside the Vector DB (Qdrant) for Multi-Hop reasoning queries in the future.