Spaces:

Peterase
/

rag-api-node-1

Running

App Files Files Community

rag-api-node-1 / docs /rag_retrieval_documentation.md

Peterase

feat(rag): implement hybrid search with live sources and production-grade intent classification

a63c61f 10 days ago

preview code

raw

history blame contribute delete

10.7 kB

	# RAG API Data Flow & Retrieval Architecture

	This document tracks the detailed Data Flow of the RAG (Retrieval-Augmented Generation) API, with a specific focus on the Retrieval Logic. Rather than just listing HTTP endpoints, this document explains the underlying methods, conceptual flow, and how the Domain Models, Ports, Use Cases, and Infrastructure Adapters interact to fetch, rerank, and summarize enterprise news data.

	---

	## 🏗️ 1. Architecture Overview (Hexagonal Architecture)

	The RAG API relies on Hexagonal Architecture (Ports and Adapters). It strongly separates business logic from infrastructure frameworks.

	- Domain/Models: The central, pure data structures representing the state (e.g., `ChatRequest`, `User`).
	- Ports (Interfaces): Abstract definitions of what the system needs to do (e.g., `VectorStorePort`, `LlmPort`).
	- Use Cases: The actual business logic where the retrieval steps, filtering, and flow occur.
	- Adapters: The concrete implementation of Ports using external technologies (e.g., Qdrant, OpenAI, Redis, Postgres).

	---

	## 📂 2. File Directory Breakdown & Responsibilities

	### `src/api/` (Primary Adapters / The Front Door)
	- `routes/rag.py`: Exposes the `/chat` and `/chat/stream` endpoints. Role: Accepts the incoming HTTP payload, validates the JWT token (via `Depends(get_current_user)`), and forwards the request directly to the `AgentRouterUseCase`.
	- `dependencies.py`: The Dependency Injection container. Role: Wires the concrete Infrastructure Adapters (e.g., `QdrantAdapter`, `BgeEmbedderAdapter`) to their respective Ports, and injects them into the Use Cases. Ensures components are instantiated only once.

	### `src/core/domain/` (Core Data)
	- `schemas.py`: Defines Pydantic validation models. Role: Houses `ChatRequest` (contains `query`, `top_k`, `session_id`, `source_filter`, etc.) which acts as the transport object through the system.

	### `src/core/ports/` (The Interfaces)
	- `embedder_port.py`: Defines `encode_query()`.
	- `vector_store_port.py`: Defines `search()`.
	- `reranker_port.py`: Defines `rerank()`.
	- `llm_port.py`: Defines `generate()` and `generate_stream()`.
	- `cache_port.py`: Defines `get()`, `set()`, and `generate_exact_hash()`.

	### `src/core/use_cases/` (The Business Logic Engine)
	- `agent_router_use_case.py`: Role: The gateway. Analyzes the user's intent. Routes the request to `AccountUseCase` (if the user is asking about personal profile data) or `RagChatUseCase` (if asking about news).
	- `rag_chat_use_case.py`: Role: The Heavy Lifter. Responsible for the entire Retrieval Logic flow. Contains methods like `_extract_intents`, `_build_context`, `_limit_context`, and `_compress_document`.
	- `account_use_case.py`: Role: A secondary flow for handling user-specific DB aggregations (billing, history) rather than searching Vector DBs.

	### `src/infrastructure/adapters/` (Concrete Infrastructure)
	- `redis_adapter.py`: Role: Connects to the caching layer to prevent duplicate LLM processing calls.
	- `qdrant_adapter.py`: Role: Orchestrates the `query_points` API call to Qdrant, fusing Dense and Sparse vector retrieval (Hybrid Search).
	- `bge_embedder_adapter.py`: Role: Instantiates the massive BGE-M3 model (using FlagEmbedding). Converts text strings into multi-dimensional arrays (Dense and Lexical Sparse weights).
	- `bge_reranker_adapter.py`: Role: Uses a Cross-Encoder to compare the user query and the retrieved documents string-by-string for absolute semantic precision.
	- `openai_adapter.py` / `ollama_adapter.py`: Role: Connects to an external OpenAI API or Local Llama-3 instance to generate text.

	---

	## 🌊 3. The Retrieval Logic: Step-by-Step Data Flow Example

	Scenario: A user submits the query: "What happened with Apple stock recently?"

	### Step 1: Ingestion & Intent Routing (`agent_router_use_case.py`)
	1. Input: `ChatRequest(query="What happened with Apple stock recently?", top_k=5)`
	2. Action: The API endpoint passes this to the `AgentRouterUseCase`.
	3. LLM Classification: The Router asks the LLM: "Is this a NEWS search or an ACCOUNT search?"
	4. Output: The LLM outputs `NEWS`. The Router forwards the request to the `RagChatUseCase`.

	### Step 2: Semantic Caching (`redis_adapter.py`)
	1. Action: `cache_port.generate_exact_hash()` calculates an SHA-256 hash or deterministic key for the query string.
	2. Check: Does this key exist in Redis?
	3. If Yes: Return the answer instantly (0ms LLM time).
	4. If No: Proceed with the expensive pipeline.

	### Step 3: Self-Query Extraction (`rag_chat_use_case.py -> _extract_intents()`)
	1. Action: The LLM analyzes the user's natural language query to dynamically extract metadata and physical parameters for the vector database.
	2. Example Prompting: The LLM is provided with a system prompt like: "Extract the temporal constraints and target sources from the user query into JSON format. Valid sources: ['reuters', 'bloomberg']."
	3. Execution: The LLM analyzes "What happened with Apple stock recently?"
	4. Output Deduction: From the word "recently", it deduces the temporal boundary and constructs the following JSON structure:
	```json
	{
	"days_back": 3,
	"source": null
	}
	```
	5. Mapping: The `RagChatUseCase` parses this JSON. If `days_back` is present, it constructs a Qdrant `models.Filter` to physically exclude older documents from the multidimensional search space before the costly vector math occurs.

	### Step 4: Embedding / Vectorization (`bge_embedder_adapter.py`)
	1. Action: `encode_query()` is called.
	2. Model Processing: The BGE-M3 model tokenizes the string.
	3. Output: Returns a `Dict` containing:
	- `dense`: `[0.123, -0.456, 0.789, ... 1024 dimensions]`
	- `sparse`: `{"indices": [102, 451, ...], "values": [0.92, 0.44, ...]}`

	### Step 5: Hybrid Vector Search (`qdrant_adapter.py`)
	1. Action: Passes the `query_vectors` and the `days_back=3` filter into `vector_store_port.search()`.
	2. Qdrant Processing: Qdrant performs a Fusion Query (Reciprocal Rank Fusion - RRF). It fetches the top 20 nearest neighbors from BOTH the Dense mathematical space AND the Sparse keyword space.
	3. Output: Returns a List of raw `SearchResult` documents.

	### Step 6: Temporal Bias Scoring (`rag_chat_use_case.py -> _build_context()`)
	1. Action: Evaluates the `published_at` metadata of every hit.
	2. Calculation: It deliberately decays the score of older articles via a mathematical multiplier (e.g., `score_multiplier = max(0.5, 1.0 - (days_old / 60))`).
	3. Output: A dynamically re-scored list, preferring fresh data.

	### Step 7: Cross-Encoder Reranking (`bge_reranker_adapter.py`)
	1. Action: For the top 20 remaining documents, the Reranker pairs the Query + Document Text together (`[[query, doc1], [query, doc2]]`).
	2. Model Processing: The HuggingFace FlagReranker calculates exact semantic overlap.
	3. Output: Returns the strict Top 5 (`top_k`) documents, guaranteed to be specifically relevant.

	### Step 8: Contextual Compression (`rag_chat_use_case.py -> _limit_context()`)
	1. Action: `_limit_context` uses `tiktoken` to count how many tokens the Top 5 documents contain.
	2. Check: Are they over the 3000 Token limit?
	3. Compression Loop: If they are over the limit, it calls `_compress_document()`.
	4. LLM Summarization: Passes the overflowing document string to the LLM with the instruction: "Extract pure facts... relevant to the query." The massive document strings are squashed down to bullet-point facts.
	5. Output: A tightly packed `context_text` string ready for generation.

	### Step 9: Final Generation (`llm_port.py`)
	1. Action: The packed `context_text`, the User `query`, and the recent `Chat History` are combined into the Final Prompt.
	2. Model Processing: The LLM interprets the compressed context.
	3. Output: The Final string ("Apple stock surged 4% after the latest earnings report...").
	4. Cleanup: This answer is saved to both Postgres (`chat_history_db`) and Redis (`cache`), and returned to the API client.

	---

	## 📈 4. A4 Analysis and Future Updates

	### A4 Analysis (Current System Standing)

	\| Dimension \| Analysis & Findings \|
	\| :--- \| :--- \|
	\| Resilience & Scalability \| High. The Hexagonal architecture successfully decoupled Qdrant, Postgres, and the LLMs. We can swap `OpenAiAdapter` for `OllamaAdapter` simply by changing one dependency provider without touching the Business Logic flow. Missing dependencies (e.g., `FlagEmbedding`) gracefully utilize dummy fallbacks avoiding hard API crashes. \|
	\| Retrieval Accuracy \| Exceptional. We utilize a 3-Stage filtering mechanism: Semantic similarity (Dense), Lexical accuracy (Sparse), and absolute context alignment (Reranker). The addition of dynamic Temporal Biasing prevents the hallucination of historical news as current events. \|
	\| Cost & Latency Management \| Optimized. The implementation of Redis Semantic Caching guarantees that recursive identical intent avoids LLM round-trip costs. The `AgentRouterUseCase` ensures unrelated general questions (Account, Billing) never touch expensive Vector DB aggregations. \|
	\| Memory Constraint Handling \| Innovative. By employing `_compress_document`, the system prevents context-window truncation, ensuring critical tail-end entities still influence the LLM's final generation. \|

	### Proposed Future Updates (Roadmap)
	1. Semantic Cache Refinement: Currently, the `RedisAdapter` relies on an exact SHA-256 string hash. Update: Calculate an actual LLM embedding of the prompt (Dense Vector) and store it in Redis. Use a Cosine-Similarity threshold (`>0.95`) to intercept semantically identical (but textually different) questions (e.g., "Apple stock" vs "AAPL share price").
	2. Analytic Trend Fusion Enhancement: In `_build_context`, we fetch trending entities from `ClickHouse`. Update: Send these trending entities into the Agent Router so the system can proactively recommend or correlate user interactions with macroeconomic spikes before they ask.
	3. Ollama Deployment Readiness: Test the `bge_embedder_adapter` and `bge_reranker_adapter` simultaneously against an active `OllamaAdapter` container to benchmark hardware-level VRAM bottlenecks on local inference machines.
	4. Knowledge Graph Integration: Extract Triples (`Subject-Predicate-Object`) during the `_compress_document` step to progressively construct a Graph Database (Neo4j) alongside the Vector DB (Qdrant) for Multi-Hop reasoning queries in the future.

	# RAG API Data Flow & Retrieval Architecture

	This document tracks the detailed Data Flow of the RAG (Retrieval-Augmented Generation) API, with a specific focus on the Retrieval Logic. Rather than just listing HTTP endpoints, this document explains the underlying methods, conceptual flow, and how the Domain Models, Ports, Use Cases, and Infrastructure Adapters interact to fetch, rerank, and summarize enterprise news data.

	---

	## 🏗️ 1. Architecture Overview (Hexagonal Architecture)

	The RAG API relies on Hexagonal Architecture (Ports and Adapters). It strongly separates business logic from infrastructure frameworks.

	- Domain/Models: The central, pure data structures representing the state (e.g., `ChatRequest`, `User`).
	- Ports (Interfaces): Abstract definitions of what the system needs to do (e.g., `VectorStorePort`, `LlmPort`).
	- Use Cases: The actual business logic where the retrieval steps, filtering, and flow occur.
	- Adapters: The concrete implementation of Ports using external technologies (e.g., Qdrant, OpenAI, Redis, Postgres).

	---

	## 📂 2. File Directory Breakdown & Responsibilities

	### `src/api/` (Primary Adapters / The Front Door)
	- `routes/rag.py`: Exposes the `/chat` and `/chat/stream` endpoints. Role: Accepts the incoming HTTP payload, validates the JWT token (via `Depends(get_current_user)`), and forwards the request directly to the `AgentRouterUseCase`.
	- `dependencies.py`: The Dependency Injection container. Role: Wires the concrete Infrastructure Adapters (e.g., `QdrantAdapter`, `BgeEmbedderAdapter`) to their respective Ports, and injects them into the Use Cases. Ensures components are instantiated only once.

	### `src/core/domain/` (Core Data)
	- `schemas.py`: Defines Pydantic validation models. Role: Houses `ChatRequest` (contains `query`, `top_k`, `session_id`, `source_filter`, etc.) which acts as the transport object through the system.

	### `src/core/ports/` (The Interfaces)
	- `embedder_port.py`: Defines `encode_query()`.
	- `vector_store_port.py`: Defines `search()`.
	- `reranker_port.py`: Defines `rerank()`.
	- `llm_port.py`: Defines `generate()` and `generate_stream()`.
	- `cache_port.py`: Defines `get()`, `set()`, and `generate_exact_hash()`.

	### `src/core/use_cases/` (The Business Logic Engine)
	- `agent_router_use_case.py`: Role: The gateway. Analyzes the user's intent. Routes the request to `AccountUseCase` (if the user is asking about personal profile data) or `RagChatUseCase` (if asking about news).
	- `rag_chat_use_case.py`: Role: The Heavy Lifter. Responsible for the entire Retrieval Logic flow. Contains methods like `_extract_intents`, `_build_context`, `_limit_context`, and `_compress_document`.
	- `account_use_case.py`: Role: A secondary flow for handling user-specific DB aggregations (billing, history) rather than searching Vector DBs.

	### `src/infrastructure/adapters/` (Concrete Infrastructure)
	- `redis_adapter.py`: Role: Connects to the caching layer to prevent duplicate LLM processing calls.
	- `qdrant_adapter.py`: Role: Orchestrates the `query_points` API call to Qdrant, fusing Dense and Sparse vector retrieval (Hybrid Search).
	- `bge_embedder_adapter.py`: Role: Instantiates the massive BGE-M3 model (using FlagEmbedding). Converts text strings into multi-dimensional arrays (Dense and Lexical Sparse weights).
	- `bge_reranker_adapter.py`: Role: Uses a Cross-Encoder to compare the user query and the retrieved documents string-by-string for absolute semantic precision.
	- `openai_adapter.py` / `ollama_adapter.py`: Role: Connects to an external OpenAI API or Local Llama-3 instance to generate text.

	---

	## 🌊 3. The Retrieval Logic: Step-by-Step Data Flow Example

	Scenario: A user submits the query: "What happened with Apple stock recently?"

	### Step 1: Ingestion & Intent Routing (`agent_router_use_case.py`)
	1. Input: `ChatRequest(query="What happened with Apple stock recently?", top_k=5)`
	2. Action: The API endpoint passes this to the `AgentRouterUseCase`.
	3. LLM Classification: The Router asks the LLM: "Is this a NEWS search or an ACCOUNT search?"
	4. Output: The LLM outputs `NEWS`. The Router forwards the request to the `RagChatUseCase`.

	### Step 2: Semantic Caching (`redis_adapter.py`)
	1. Action: `cache_port.generate_exact_hash()` calculates an SHA-256 hash or deterministic key for the query string.
	2. Check: Does this key exist in Redis?
	3. If Yes: Return the answer instantly (0ms LLM time).
	4. If No: Proceed with the expensive pipeline.

	### Step 3: Self-Query Extraction (`rag_chat_use_case.py -> _extract_intents()`)
	1. Action: The LLM analyzes the user's natural language query to dynamically extract metadata and physical parameters for the vector database.
	2. Example Prompting: The LLM is provided with a system prompt like: "Extract the temporal constraints and target sources from the user query into JSON format. Valid sources: ['reuters', 'bloomberg']."
	3. Execution: The LLM analyzes "What happened with Apple stock recently?"
	4. Output Deduction: From the word "recently", it deduces the temporal boundary and constructs the following JSON structure:
	```json
	{
	"days_back": 3,
	"source": null
	}
	```
	5. Mapping: The `RagChatUseCase` parses this JSON. If `days_back` is present, it constructs a Qdrant `models.Filter` to physically exclude older documents from the multidimensional search space before the costly vector math occurs.

	### Step 4: Embedding / Vectorization (`bge_embedder_adapter.py`)
	1. Action: `encode_query()` is called.
	2. Model Processing: The BGE-M3 model tokenizes the string.
	3. Output: Returns a `Dict` containing:
	- `dense`: `[0.123, -0.456, 0.789, ... 1024 dimensions]`
	- `sparse`: `{"indices": [102, 451, ...], "values": [0.92, 0.44, ...]}`

	### Step 5: Hybrid Vector Search (`qdrant_adapter.py`)
	1. Action: Passes the `query_vectors` and the `days_back=3` filter into `vector_store_port.search()`.
	2. Qdrant Processing: Qdrant performs a Fusion Query (Reciprocal Rank Fusion - RRF). It fetches the top 20 nearest neighbors from BOTH the Dense mathematical space AND the Sparse keyword space.
	3. Output: Returns a List of raw `SearchResult` documents.

	### Step 6: Temporal Bias Scoring (`rag_chat_use_case.py -> _build_context()`)
	1. Action: Evaluates the `published_at` metadata of every hit.
	2. Calculation: It deliberately decays the score of older articles via a mathematical multiplier (e.g., `score_multiplier = max(0.5, 1.0 - (days_old / 60))`).
	3. Output: A dynamically re-scored list, preferring fresh data.

	### Step 7: Cross-Encoder Reranking (`bge_reranker_adapter.py`)
	1. Action: For the top 20 remaining documents, the Reranker pairs the Query + Document Text together (`[[query, doc1], [query, doc2]]`).
	2. Model Processing: The HuggingFace FlagReranker calculates exact semantic overlap.
	3. Output: Returns the strict Top 5 (`top_k`) documents, guaranteed to be specifically relevant.

	### Step 8: Contextual Compression (`rag_chat_use_case.py -> _limit_context()`)
	1. Action: `_limit_context` uses `tiktoken` to count how many tokens the Top 5 documents contain.
	2. Check: Are they over the 3000 Token limit?
	3. Compression Loop: If they are over the limit, it calls `_compress_document()`.
	4. LLM Summarization: Passes the overflowing document string to the LLM with the instruction: "Extract pure facts... relevant to the query." The massive document strings are squashed down to bullet-point facts.
	5. Output: A tightly packed `context_text` string ready for generation.

	### Step 9: Final Generation (`llm_port.py`)
	1. Action: The packed `context_text`, the User `query`, and the recent `Chat History` are combined into the Final Prompt.
	2. Model Processing: The LLM interprets the compressed context.
	3. Output: The Final string ("Apple stock surged 4% after the latest earnings report...").
	4. Cleanup: This answer is saved to both Postgres (`chat_history_db`) and Redis (`cache`), and returned to the API client.

	---

	## 📈 4. A4 Analysis and Future Updates

	### A4 Analysis (Current System Standing)

	\| Dimension \| Analysis & Findings \|
	\| :--- \| :--- \|
	\| Resilience & Scalability \| High. The Hexagonal architecture successfully decoupled Qdrant, Postgres, and the LLMs. We can swap `OpenAiAdapter` for `OllamaAdapter` simply by changing one dependency provider without touching the Business Logic flow. Missing dependencies (e.g., `FlagEmbedding`) gracefully utilize dummy fallbacks avoiding hard API crashes. \|
	\| Retrieval Accuracy \| Exceptional. We utilize a 3-Stage filtering mechanism: Semantic similarity (Dense), Lexical accuracy (Sparse), and absolute context alignment (Reranker). The addition of dynamic Temporal Biasing prevents the hallucination of historical news as current events. \|
	\| Cost & Latency Management \| Optimized. The implementation of Redis Semantic Caching guarantees that recursive identical intent avoids LLM round-trip costs. The `AgentRouterUseCase` ensures unrelated general questions (Account, Billing) never touch expensive Vector DB aggregations. \|
	\| Memory Constraint Handling \| Innovative. By employing `_compress_document`, the system prevents context-window truncation, ensuring critical tail-end entities still influence the LLM's final generation. \|

	### Proposed Future Updates (Roadmap)
	1. Semantic Cache Refinement: Currently, the `RedisAdapter` relies on an exact SHA-256 string hash. Update: Calculate an actual LLM embedding of the prompt (Dense Vector) and store it in Redis. Use a Cosine-Similarity threshold (`>0.95`) to intercept semantically identical (but textually different) questions (e.g., "Apple stock" vs "AAPL share price").
	2. Analytic Trend Fusion Enhancement: In `_build_context`, we fetch trending entities from `ClickHouse`. Update: Send these trending entities into the Agent Router so the system can proactively recommend or correlate user interactions with macroeconomic spikes before they ask.
	3. Ollama Deployment Readiness: Test the `bge_embedder_adapter` and `bge_reranker_adapter` simultaneously against an active `OllamaAdapter` container to benchmark hardware-level VRAM bottlenecks on local inference machines.
	4. Knowledge Graph Integration: Extract Triples (`Subject-Predicate-Object`) during the `_compress_document` step to progressively construct a Graph Database (Neo4j) alongside the Vector DB (Qdrant) for Multi-Hop reasoning queries in the future.