Spaces:

Peterase
/

rag-api-node-1

Running

App Files Files Community

rag-api-node-1 / docs /ANALYSIS_ONE.md

Peterase

feat(rag): implement hybrid search with live sources and production-grade intent classification

a63c61f 15 days ago

preview code

raw

history blame contribute delete

4.51 kB

	# RAG API Analysis & Critique

	This document provides a critical evaluation of the current RAG (Retrieval-Augmented Generation) API implementation and outlines a path toward a fully optimized production system.

	## Current Status: "Basic RAG"
	The current implementation is a functional "Naive RAG" pipeline. It successfully connects the core components (Embedding -> Vector DB -> LLM), but it lacks the advanced optimizations required for a high-quality production system.

	Is it fully implemented?
	- Technically: Yes. It performs retrieval and generation.
	- Strategically: No. It lacks query refinement, re-ranking, and context optimization.

	---

	## Critical Weaknesses & Solutions

	### 1. Simple Vector Retrieval (Naive Search)
	- Problem: It relies solely on dense embeddings (BGE-M3). While powerful, dense search often fails on specific keywords, acronyms, or names that weren't frequent in the model's training data.
	- Reason: Pure semantic search can have "false positives" where semantically similar but factually irrelevant text is retrieved.
	- Solution: Implement Hybrid Search. Combine dense vector search with sparse keyword search (e.g., BM25/Elasticsearch/Qdrant sparse vectors).

	### 2. Multi-turn Query "Drift"
	- Problem: The query sent to the vector database is the raw user input.
	- Reason: In a chat, a user might say "Tell me more about it." The word "it" has no semantic meaning for a vector search without the previous context.
	- Solution: Query Transformation. Before retrieval, use an LLM to "rewrite" the user's query into a standalone, descriptive search query based on the chat history.

	### 3. Lack of Re-ranking
	- Problem: The top $K$ results from the vector database are passed directly to the LLM.
	- Reason: Vector databases optimize for speed, not absolute precision. The "Top 1" result might not be the most relevant answer to the specific question.
	- Solution: Add a Re-ranker (e.g., Cohere Rerank or a Cross-Encoder model). Retrieve 20 chunks, re-score them, and pass only the top 5 most relevant ones to the LLM.

	### 4. Context Overflow & Noise
	- Problem: Chunks are concatenated without token validation or noise reduction.
	- Reason: Passing too much irrelevant context ("Noise") confuses the LLM and increases latency/cost.
	- Solution: Implement Context Filtering and Token Counting. Use `tiktoken` to ensure the prompt stays within limits and use the LLM to filter out chunks that don't actually help answer the question.

	---

	## Proposed Enhancement Plan

	### Phase 1: Robustness (Immediate)
	- [x] Add `tiktoken` for context window management.
	- [x] Implement query rewriting for better multi-turn retrieval.
	- [x] Add explicit error handling for embedding model loading failures.

	### Phase 2: Retrieval Quality (Intermediate)
	- [x] Configure Qdrant for deeper search depth.
	- [x] Integrate a Cross-Encoder for Re-ranking retrieved articles.

	### Phase 3: Developer Experience
	- [ ] Add an evaluation pipeline (e.g., Ragas) to measure "Faithfulness" and "Answer Relevancy".

	---

	## Conclusion
	The RAG API has been upgraded from a Proof of Concept (PoC) to an Advanced RAG implementation. It now handles complex, multi-turn questions with high precision and robust context management.

	---

	## Current Implementation & Solutions

	As of the latest update, the following solutions have been implemented to address the weaknesses identified above:

	### 1. Search Precision (Depth + Rank)
	- Status: Implemented
	- Solution: Increased initial retrieval depth (20 candidates) and integrated a second-stage re-ranking process. This ensures that even if semantic search doesn't put the best result first, the re-ranker will find it.

	### 2. Query Transformation
	- Status: Implemented
	- Solution: Added an LLM-based query rewriting step that uses chat history to rephrase user follow-ups into standalone search queries. This eliminates "query drift" in multi-turn conversations.

	### 3. Cross-Encoder Re-ranking
	- Status: Implemented
	- Solution: Integrated a dedicated `RerankerService` using a Cross-Encoder model. This re-evaluates the relevance of retrieved chunks against the actual query.

	### 4. Token-Aware Context Management
	- Status: Implemented
	- Solution: Integrated `tiktoken` for precise token counting. Implemented logic to prune and truncate retrieved chunks to fit within a 3000-token budget, preventing prompt overflow.