rag-api-node-1 / docs /Back end Arctecture /scalable_architecture.md
Peterase's picture
feat(rag): implement hybrid search with live sources and production-grade intent classification
a63c61f

RAG API Design: Retrieval Architecture

This document focuses specifically on the API layer designed to retrieve data from our existing, highly optimized data pipeline. Because the heavy lifting of processing, vectorization (BGE-M3 Dense + Sparse), and indexing is already handled by the Kafka and Qdrant workers, this API is designed purely for scalable, high-performance retrieval and generation.


🎯 1. Core API Philosophy

The RAG API acts as the bridge between user queries (from the frontend) and the populated Qdrant vector database.

  1. Read-Only Operations: This API does not write to Qdrant or ClickHouse. It assumes the databases are already hydrated by the Kafka workers.
  2. Symmetry with Ingestion: The API must use the exact same BGE-M3 model for hashing user queries that the Embedding Service uses to hash news articles.
  3. Statelessness: The API nodes hold no session state, allowing infinite horizontal scaling behind a Load Balancer.

🌐 2. Core API Endpoints

2.1 POST /api/v1/search (Hybrid Search Only)

  • Purpose: The fastest way to find relevant articles without generating an LLM response. Useful for standard "News Search" bars.
  • Input Request:
    {
      "query": "Quantum computing breakthroughs in 2026",
      "limit": 10,
      "filters": {
        "source": ["TechCrunch", "Wired"],
        "date_range": { "start": "2026-01-01", "end": "2026-12-31" }
      }
    }
    
  • Internal Flow:
    1. Passes the query text through the BGE-M3 Tokenizer & Model (synchronously or via lightweight async executor).
    2. Extracts the Dense vector (1024-dim) and Sparse lexical weights.
    3. Queries Qdrant using a Prefetch query (combining Dense + Sparse scoring).
    4. Extracts the Qdrant payload (article metadata) and returns it.
  • Response: A JSON list of articles sorted by relevance score.

2.2 POST /api/v1/rag/ask (Full RAG Flow)

  • Purpose: The endpoint for natural language Q&A. This hits Qdrant first, then sends the context to the LLM.
  • Input Request:
    {
      "question": "What did Google recently announce regarding quantum processors?",
      "stream": true, // Critical for UX
      "top_k": 5
    }
    
  • Internal Flow:
    1. Retrieve: Performs the exact same Hybrid Search as /api/v1/search to get the top 5 article chunks.
    2. Prompt Assembly: Constructs a structured prompt template: "Use the following news articles to answer the question...\n\nCONTEXT:\n[Article 1 Text...]\n[Article 2 Text...]\n\nQUESTION: What did Google recently announce..."
    3. Generate: Sends the assembled prompt to the LLM (OpenAI, local Llama-3, etc.).
    4. Stream: Uses Server-Sent Events (SSE) to yield tokens to the frontend as they are generated.

🧠 3. Query Vectorization Pipeline (Symmetry)

For Qdrant search to work perfectly, the API must emulate Step 4 of the Data Flow Pipeline exactly.

# RAG API Vectorization Logic
def vectorize_query(query_text: str):
    # Uses the SAME FlagEmbedding configuration as the ingestor
    embeddings = model.encode(
        sentences=[query_text],
        batch_size=1,
        max_length=512, # Queries are shorter than articles
        return_dense=True,
        return_sparse=True,
        return_colbert_vecs=False
    )
    
    return {
        "dense": embeddings['dense_vecs'][0].tolist(),
        "sparse": {
            "indices": list(embeddings['lexical_weights'][0].keys()),
            "values": list(embeddings['lexical_weights'][0].values())
        }
    }

⚑ 4. Scalability at the Retrieval Layer

Since the Heavy ETL is done by the pipelines, the API's main bottleneck is waiting for Qdrant and the LLM.

4.1 Async FastAPI

  • The API is built purely on async def endpoints.
  • When the API queries Qdrant (await qdrant_client.async_search(...)), it yields the thread back to the event loop.
  • A single FastAPI container can handle thousands of concurrent searches while waiting for Qdrant to respond.

4.2 Semantic Query Caching (Redis)

To save LLM compute and Qdrant load:

  • We implement Redis Semantic Caching.
  • If User A asks: "What is Tesla's stock doing?" and User B asks "How is the Tesla stock performing?", the semantic cache recognizes the queries are identical in meaning (High Cosine Similarity) and instantly returns User A's cached LLM response to User B.

4.3 Streaming (SSE) for LLMs

  • Generating a 500-word RAG answer might take the LLM 3 seconds. Instead of a loading spinner for 3 seconds, the API uses StreamingResponse. The user sees the first word in 200ms, creating a "Real-Time" feel.

πŸ“Š 5. Integration with Pipeline Analytics

If the RAG API needs to answer questions like "How many articles mentioned AI today?", it should NOT query Qdrant. Qdrant is a Vector Search engine, not an Analytics database.

For structured analytics, the API connects directly to ClickHouse (which the Kafka sink worker hydrates), allowing real-time aggregations without disturbing the vector search performance.