Spaces:
Running
RAG API Design: Retrieval Architecture
This document focuses specifically on the API layer designed to retrieve data from our existing, highly optimized data pipeline. Because the heavy lifting of processing, vectorization (BGE-M3 Dense + Sparse), and indexing is already handled by the Kafka and Qdrant workers, this API is designed purely for scalable, high-performance retrieval and generation.
π― 1. Core API Philosophy
The RAG API acts as the bridge between user queries (from the frontend) and the populated Qdrant vector database.
- Read-Only Operations: This API does not write to Qdrant or ClickHouse. It assumes the databases are already hydrated by the Kafka workers.
- Symmetry with Ingestion: The API must use the exact same BGE-M3 model for hashing user queries that the Embedding Service uses to hash news articles.
- Statelessness: The API nodes hold no session state, allowing infinite horizontal scaling behind a Load Balancer.
π 2. Core API Endpoints
2.1 POST /api/v1/search (Hybrid Search Only)
- Purpose: The fastest way to find relevant articles without generating an LLM response. Useful for standard "News Search" bars.
- Input Request:
{ "query": "Quantum computing breakthroughs in 2026", "limit": 10, "filters": { "source": ["TechCrunch", "Wired"], "date_range": { "start": "2026-01-01", "end": "2026-12-31" } } } - Internal Flow:
- Passes the
querytext through the BGE-M3 Tokenizer & Model (synchronously or via lightweight async executor). - Extracts the
Densevector (1024-dim) andSparselexical weights. - Queries Qdrant using a
Prefetchquery (combining Dense + Sparse scoring). - Extracts the Qdrant
payload(article metadata) and returns it.
- Passes the
- Response: A JSON list of articles sorted by relevance score.
2.2 POST /api/v1/rag/ask (Full RAG Flow)
- Purpose: The endpoint for natural language Q&A. This hits Qdrant first, then sends the context to the LLM.
- Input Request:
{ "question": "What did Google recently announce regarding quantum processors?", "stream": true, // Critical for UX "top_k": 5 } - Internal Flow:
- Retrieve: Performs the exact same Hybrid Search as
/api/v1/searchto get the top 5 article chunks. - Prompt Assembly: Constructs a structured prompt template:
"Use the following news articles to answer the question...\n\nCONTEXT:\n[Article 1 Text...]\n[Article 2 Text...]\n\nQUESTION: What did Google recently announce..." - Generate: Sends the assembled prompt to the LLM (OpenAI, local Llama-3, etc.).
- Stream: Uses Server-Sent Events (SSE) to yield tokens to the frontend as they are generated.
- Retrieve: Performs the exact same Hybrid Search as
π§ 3. Query Vectorization Pipeline (Symmetry)
For Qdrant search to work perfectly, the API must emulate Step 4 of the Data Flow Pipeline exactly.
# RAG API Vectorization Logic
def vectorize_query(query_text: str):
# Uses the SAME FlagEmbedding configuration as the ingestor
embeddings = model.encode(
sentences=[query_text],
batch_size=1,
max_length=512, # Queries are shorter than articles
return_dense=True,
return_sparse=True,
return_colbert_vecs=False
)
return {
"dense": embeddings['dense_vecs'][0].tolist(),
"sparse": {
"indices": list(embeddings['lexical_weights'][0].keys()),
"values": list(embeddings['lexical_weights'][0].values())
}
}
β‘ 4. Scalability at the Retrieval Layer
Since the Heavy ETL is done by the pipelines, the API's main bottleneck is waiting for Qdrant and the LLM.
4.1 Async FastAPI
- The API is built purely on
async defendpoints. - When the API queries Qdrant (
await qdrant_client.async_search(...)), it yields the thread back to the event loop. - A single FastAPI container can handle thousands of concurrent searches while waiting for Qdrant to respond.
4.2 Semantic Query Caching (Redis)
To save LLM compute and Qdrant load:
- We implement Redis Semantic Caching.
- If User A asks: "What is Tesla's stock doing?" and User B asks "How is the Tesla stock performing?", the semantic cache recognizes the queries are identical in meaning (High Cosine Similarity) and instantly returns User A's cached LLM response to User B.
4.3 Streaming (SSE) for LLMs
- Generating a 500-word RAG answer might take the LLM 3 seconds. Instead of a loading spinner for 3 seconds, the API uses
StreamingResponse. The user sees the first word in 200ms, creating a "Real-Time" feel.
π 5. Integration with Pipeline Analytics
If the RAG API needs to answer questions like "How many articles mentioned AI today?", it should NOT query Qdrant. Qdrant is a Vector Search engine, not an Analytics database.
For structured analytics, the API connects directly to ClickHouse (which the Kafka sink worker hydrates), allowing real-time aggregations without disturbing the vector search performance.