Spaces:

Peterase
/

rag-api-node-1

Running

App Files Files Community

rag-api-node-1 / docs /Back end Arctecture /scalable_architecture.md

Peterase

feat(rag): implement hybrid search with live sources and production-grade intent classification

a63c61f 29 days ago

preview code

raw

history blame contribute delete

5.2 kB

	# RAG API Design: Retrieval Architecture

	This document focuses specifically on the API layer designed to retrieve data from our existing, highly optimized data pipeline. Because the heavy lifting of processing, vectorization (BGE-M3 Dense + Sparse), and indexing is already handled by the Kafka and Qdrant workers, this API is designed purely for scalable, high-performance retrieval and generation.

	---

	## 🎯 1. Core API Philosophy

	The RAG API acts as the bridge between user queries (from the frontend) and the populated Qdrant vector database.

	1. Read-Only Operations: This API does not write to Qdrant or ClickHouse. It assumes the databases are already hydrated by the Kafka workers.
	2. Symmetry with Ingestion: The API must use the exact same BGE-M3 model for hashing user queries that the Embedding Service uses to hash news articles.
	3. Statelessness: The API nodes hold no session state, allowing infinite horizontal scaling behind a Load Balancer.

	---

	## 🌐 2. Core API Endpoints

	### 2.1 `POST /api/v1/search` (Hybrid Search Only)
	* Purpose: The fastest way to find relevant articles without generating an LLM response. Useful for standard "News Search" bars.
	* Input Request:
	```json
	{
	"query": "Quantum computing breakthroughs in 2026",
	"limit": 10,
	"filters": {
	"source": ["TechCrunch", "Wired"],
	"date_range": { "start": "2026-01-01", "end": "2026-12-31" }
	}
	}
	```
	* Internal Flow:
	1. Passes the `query` text through the BGE-M3 Tokenizer & Model (synchronously or via lightweight async executor).
	2. Extracts the `Dense` vector (1024-dim) and `Sparse` lexical weights.
	3. Queries Qdrant using a `Prefetch` query (combining Dense + Sparse scoring).
	4. Extracts the Qdrant `payload` (article metadata) and returns it.
	* Response: A JSON list of articles sorted by relevance score.

	### 2.2 `POST /api/v1/rag/ask` (Full RAG Flow)
	* Purpose: The endpoint for natural language Q&A. This hits Qdrant first, then sends the context to the LLM.
	* Input Request:
	```json
	{
	"question": "What did Google recently announce regarding quantum processors?",
	"stream": true, // Critical for UX
	"top_k": 5
	}
	```
	* Internal Flow:
	1. Retrieve: Performs the exact same Hybrid Search as `/api/v1/search` to get the top 5 article chunks.
	2. Prompt Assembly: Constructs a structured prompt template:
	`"Use the following news articles to answer the question...\n\nCONTEXT:\n[Article 1 Text...]\n[Article 2 Text...]\n\nQUESTION: What did Google recently announce..."`
	3. Generate: Sends the assembled prompt to the LLM (OpenAI, local Llama-3, etc.).
	4. Stream: Uses Server-Sent Events (SSE) to yield tokens to the frontend as they are generated.

	---

	## 🧠 3. Query Vectorization Pipeline (Symmetry)

	For Qdrant search to work perfectly, the API must emulate Step 4 of the Data Flow Pipeline exactly.

	```python
	# RAG API Vectorization Logic
	def vectorize_query(query_text: str):
	# Uses the SAME FlagEmbedding configuration as the ingestor
	embeddings = model.encode(
	sentences=[query_text],
	batch_size=1,
	max_length=512, # Queries are shorter than articles
	return_dense=True,
	return_sparse=True,
	return_colbert_vecs=False
	)

	return {
	"dense": embeddings['dense_vecs'][0].tolist(),
	"sparse": {
	"indices": list(embeddings['lexical_weights'][0].keys()),
	"values": list(embeddings['lexical_weights'][0].values())
	}
	}
	```

	---

	## ⚡ 4. Scalability at the Retrieval Layer

	Since the Heavy ETL is done by the pipelines, the API's main bottleneck is waiting for Qdrant and the LLM.

	### 4.1 Async FastAPI
	* The API is built purely on `async def` endpoints.
	* When the API queries Qdrant (`await qdrant_client.async_search(...)`), it yields the thread back to the event loop.
	* A single FastAPI container can handle thousands of concurrent searches while waiting for Qdrant to respond.

	### 4.2 Semantic Query Caching (Redis)
	To save LLM compute and Qdrant load:
	* We implement Redis Semantic Caching.
	* If User A asks: "What is Tesla's stock doing?" and User B asks "How is the Tesla stock performing?", the semantic cache recognizes the queries are identical in meaning (High Cosine Similarity) and instantly returns User A's cached LLM response to User B.

	### 4.3 Streaming (SSE) for LLMs
	* Generating a 500-word RAG answer might take the LLM 3 seconds. Instead of a loading spinner for 3 seconds, the API uses `StreamingResponse`. The user sees the first word in 200ms, creating a "Real-Time" feel.

	---

	## 📊 5. Integration with Pipeline Analytics
	If the RAG API needs to answer questions like "How many articles mentioned AI today?", it should NOT query Qdrant.
	Qdrant is a Vector Search engine, not an Analytics database.

	For structured analytics, the API connects directly to ClickHouse (which the Kafka `sink` worker hydrates), allowing real-time aggregations without disturbing the vector search performance.