Spaces:
Running
Running
| # RAG API Design: Retrieval Architecture | |
| This document focuses specifically on the API layer designed to **retrieve** data from our existing, highly optimized data pipeline. Because the heavy lifting of processing, vectorization (BGE-M3 Dense + Sparse), and indexing is already handled by the Kafka and Qdrant workers, this API is designed purely for **scalable, high-performance retrieval and generation**. | |
| --- | |
| ## π― 1. Core API Philosophy | |
| The RAG API acts as the bridge between user queries (from the frontend) and the populated Qdrant vector database. | |
| 1. **Read-Only Operations:** This API does *not* write to Qdrant or ClickHouse. It assumes the databases are already hydrated by the Kafka workers. | |
| 2. **Symmetry with Ingestion:** The API must use the exact same BGE-M3 model for hashing user queries that the Embedding Service uses to hash news articles. | |
| 3. **Statelessness:** The API nodes hold no session state, allowing infinite horizontal scaling behind a Load Balancer. | |
| --- | |
| ## π 2. Core API Endpoints | |
| ### 2.1 `POST /api/v1/search` (Hybrid Search Only) | |
| * **Purpose:** The fastest way to find relevant articles without generating an LLM response. Useful for standard "News Search" bars. | |
| * **Input Request:** | |
| ```json | |
| { | |
| "query": "Quantum computing breakthroughs in 2026", | |
| "limit": 10, | |
| "filters": { | |
| "source": ["TechCrunch", "Wired"], | |
| "date_range": { "start": "2026-01-01", "end": "2026-12-31" } | |
| } | |
| } | |
| ``` | |
| * **Internal Flow:** | |
| 1. Passes the `query` text through the BGE-M3 Tokenizer & Model (synchronously or via lightweight async executor). | |
| 2. Extracts the `Dense` vector (1024-dim) and `Sparse` lexical weights. | |
| 3. Queries Qdrant using a `Prefetch` query (combining Dense + Sparse scoring). | |
| 4. Extracts the Qdrant `payload` (article metadata) and returns it. | |
| * **Response:** A JSON list of articles sorted by relevance score. | |
| ### 2.2 `POST /api/v1/rag/ask` (Full RAG Flow) | |
| * **Purpose:** The endpoint for natural language Q&A. This hits Qdrant first, then sends the context to the LLM. | |
| * **Input Request:** | |
| ```json | |
| { | |
| "question": "What did Google recently announce regarding quantum processors?", | |
| "stream": true, // Critical for UX | |
| "top_k": 5 | |
| } | |
| ``` | |
| * **Internal Flow:** | |
| 1. **Retrieve:** Performs the exact same Hybrid Search as `/api/v1/search` to get the top 5 article chunks. | |
| 2. **Prompt Assembly:** Constructs a structured prompt template: | |
| `"Use the following news articles to answer the question...\n\nCONTEXT:\n[Article 1 Text...]\n[Article 2 Text...]\n\nQUESTION: What did Google recently announce..."` | |
| 3. **Generate:** Sends the assembled prompt to the LLM (OpenAI, local Llama-3, etc.). | |
| 4. **Stream:** Uses Server-Sent Events (SSE) to yield tokens to the frontend as they are generated. | |
| --- | |
| ## π§ 3. Query Vectorization Pipeline (Symmetry) | |
| For Qdrant search to work perfectly, the API must emulate Step 4 of the *Data Flow Pipeline* exactly. | |
| ```python | |
| # RAG API Vectorization Logic | |
| def vectorize_query(query_text: str): | |
| # Uses the SAME FlagEmbedding configuration as the ingestor | |
| embeddings = model.encode( | |
| sentences=[query_text], | |
| batch_size=1, | |
| max_length=512, # Queries are shorter than articles | |
| return_dense=True, | |
| return_sparse=True, | |
| return_colbert_vecs=False | |
| ) | |
| return { | |
| "dense": embeddings['dense_vecs'][0].tolist(), | |
| "sparse": { | |
| "indices": list(embeddings['lexical_weights'][0].keys()), | |
| "values": list(embeddings['lexical_weights'][0].values()) | |
| } | |
| } | |
| ``` | |
| --- | |
| ## β‘ 4. Scalability at the Retrieval Layer | |
| Since the Heavy ETL is done by the pipelines, the API's main bottleneck is **waiting** for Qdrant and the LLM. | |
| ### 4.1 Async FastAPI | |
| * The API is built purely on `async def` endpoints. | |
| * When the API queries Qdrant (`await qdrant_client.async_search(...)`), it yields the thread back to the event loop. | |
| * A single FastAPI container can handle thousands of concurrent searches while waiting for Qdrant to respond. | |
| ### 4.2 Semantic Query Caching (Redis) | |
| To save LLM compute and Qdrant load: | |
| * We implement Redis **Semantic Caching**. | |
| * If User A asks: *"What is Tesla's stock doing?"* and User B asks *"How is the Tesla stock performing?"*, the semantic cache recognizes the queries are identical in meaning (High Cosine Similarity) and instantly returns User A's cached LLM response to User B. | |
| ### 4.3 Streaming (SSE) for LLMs | |
| * Generating a 500-word RAG answer might take the LLM 3 seconds. Instead of a loading spinner for 3 seconds, the API uses `StreamingResponse`. The user sees the first word in 200ms, creating a "Real-Time" feel. | |
| --- | |
| ## π 5. Integration with Pipeline Analytics | |
| If the RAG API needs to answer questions like *"How many articles mentioned AI today?"*, it should NOT query Qdrant. | |
| Qdrant is a Vector Search engine, not an Analytics database. | |
| For structured analytics, the API connects directly to **ClickHouse** (which the Kafka `sink` worker hydrates), allowing real-time aggregations without disturbing the vector search performance. | |