rag-api-node-1 / docs /ANLYSIS_four.md
Peterase's picture
feat(rag): implement hybrid search with live sources and production-grade intent classification
a63c61f

Comprehensive RAG API Analysis


1. Architecture & API Design

The Problem (Critique)

The current RAG implementation in src/api/routes/rag.py suffers from extreme tight coupling. The routing function (chat_with_rag) handles HTTP request parsing, conversation history retrieval from the database, query transformation via LLM, searching the vector database, applying temporal biases, executing reranking, managing token limits, prompting the final LLM, mixing in warehouse data, and finally saving the interaction back to the database. This monolithic design violates the Single Responsibility Principle, making the code hard to read, exceptionally difficult to unit test, and prone to breaking during feature additions.

The Reason

During rapid prototyping and initial development phases, it is common to build "fat controllers." Developers prioritize getting the feature working end-to-end quickly rather than designing for long-term maintainability. The focus was on chaining the LangChain, Qdrant, and database operations together to prove the RAG concept works, rather than building a scalable backend architecture.

The Solution

To improve this for a real-world, production-ready environment, the RAG API needs to adopt a strict Controller-Service-Repository pattern.

  1. Routing Layer (rag.py): Should only handle request validation (Pydantic), calling the appropriate service, and formatting the HTTP output.
  2. Service Layer (rag_service.py): A dedicated service class that orchestrates the RAG pipeline. This service would coordinate with embedder, vector_store, an llm_manager, and the interaction_db.
  3. Discrete Workflows: Complex steps like query transformation, context formatting, and token management should be separated into their own testable functions or classes (e.g., QueryTransformer, ContextManager). This decoupling allows developers to swap out components (like changing the LLM provider or vector DB) without rewriting the core business logic.

2. Data Retrieval & DB Interaction

The Problem (Critique)

The current retrieval mechanism relies entirely on dense vector representations. The embedder.py script specifically mentions BGE-M3 but returns a dummy None value for sparse vectors. The vector_store.py calls Qdrant using only the dense query vector. Consequently, the system performs a standard K-Nearest Neighbors (KNN) search but lacks keyword-awareness (BM25 or Sparse Embedding representation). Furthermore, the fallback search mechanism queries sentiment_results from ClickHouse via data_warehouse.query, which is rudimentary, returning mocked hits with flat 0.5 scores instead of true relevance.

The Reason

Implementing true Hybrid Search (combining dense embeddings semantic meaning with sparse embeddings lexical keyword matching) is complex. BGE-M3 generates both, but Qdrant must be specifically configured, indexed, and queried to handle multi-vector (dense + sparse) payloads. The developers opted for the simpler dense-only retrieval path to guarantee functionality initially, leaving sparse vectors as a "TODO" placeholder.

The Solution

To build a "Real World" robust RAG search:

  1. Activate Sparse Embeddings: Update embedder.py to correctly extract BGE-M3's sparse lexical weights (colbert or lexical dictionaries) and format them for Qdrant.
  2. Implement Hybrid Search in Qdrant: Update vector_store.py's search method to execute Qdrant's search_batch or query API combining dense similarity and sparse BM25 text match with Reciprocal Rank Fusion (RRF) or explicit weighted scoring.
  3. Enhance Fallback: Improve the ClickHouse SQL fallback to utilize full-text search operators (LIKE or hasToken) instead of basic ordering, to yield relevant results when the vector database is unreachable.

3. Prompt Engineering & Context Management

The Problem (Critique)

The prompt strings (RAG_PROMPT and QUERY_REWRITE_PROMPT) are hardcoded directly within src/api/routes/rag.py. Furthermore, the token limits are managed by a custom limit_context_tokens function that performs rudimentary mathematical truncation (truncated = content[:remaining * 4]) to force-fit text into an arbitrary 3000 token limit. This approach is highly destructive; it truncates strings mid-word, breaks Markdown formatting, and severs semantic sentences. Additionally, 'Trending News' is hackily injected by fetching from data_warehouse.py and blindly appending it to the top of the context string.

The Reason

Embedding prompts directly in routing files is a common shortcut during early MVP stages. Likewise, accurately chunking text requires importing recursive character splitters and sophisticated tokenizers, so a naive mathematical approximation was used to prevent maximum context window errors with the OpenAI API.

The Solution

For real-world scaling and better response quality:

  1. Prompt Management: Move all prompt templates into a centralized src/core/prompts.py file or load them from versioned YAML/JSON configurations. This allows tuning the AI persona without altering Python backend logic.
  2. Intelligent Text Splitting: Replace limit_context_tokens with a robust text splitter from LangChain (e.g., RecursiveCharacterTextSplitter). This ensures chunks are broken cleanly at paragraph or sentence boundaries (\n\n, .), preserving meaning.
  3. Context Construction: Formally separate the "Trending Data" injection from the standard document context injection, explicitly mapping out system instructions versus retrieved context sources. This yields cleaner behavior from large language models.

4. Error Handling, Logging, and Security

The Problem (Critique)

The current RAG implementation uses extremely broad exception catching (except Exception as e:). In rag.py, if Qdrant throws an error, it is merely printed (print(f"Error searching vector store: {e}")) and an empty result set is passed to the LLM. If query rewriting fails, it prints and proceeds with original prompt. Important transactions fail silently and the user interface receives generic or poor answers without knowing the backend components degraded. Python's default print is used instead of the standard library logging module, meaning errors aren't easily searchable in production logs.

The Reason

Defensive programming is often implemented this way to prevent the entire API from crashing (returning an HTTP 500) if a non-critical component like temporal bias or reranking fails. However, the side effect is an inability to monitor system health and "silent failures." The print statements were left over from local development debugging.

The Solution

In a production-ready ("Real World") backend:

  1. Structured Logging: Replace all instances of print() with Python's standard logging.getLogger(__name__). Integrate JSON logging so log aggregation platforms (Datadog, ELK) can parse context (session_id, user_id).
  2. Targeted Exception Handling: Catch specific exceptions (e.g., TimeoutError, qdrant_client.http.exceptions.UnexpectedResponse). Decide explicitly which errors are fatal (raise HTTPException(status_code=500)) and which are degradable.
  3. Telemetry & Client Feedback: When a degradation occurs (e.g., Qdrant is down, using ClickHouse fallback), include a warnings or metadata dict in the HTTP JSON response so the client application knows the data might be suboptimal.