Spaces:
Running
Running
Commit ·
a2f2da3
1
Parent(s): 9293399
Add conversational memory
Browse files- ARCHITECTURE_EXPLANATION.md +118 -0
- api/main.py +43 -6
- api/schemas.py +5 -0
- generation/generator.py +127 -8
- ui/static/script.js +31 -8
ARCHITECTURE_EXPLANATION.md
ADDED
|
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Cortex RAG — Architecture & Implementation Guide
|
| 2 |
+
|
| 3 |
+
This document provides a deep dive into the architecture of **Cortex**, a production-grade Retrieval-Augmented Generation (RAG) system. This guide is structured to help you explain the "what", "how", and "why" of each layer during your GenAI Engineer interview.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## 🏗️ 1. High-Level Architecture Overview
|
| 8 |
+
|
| 9 |
+
Cortex follows a modular, multi-layer RAG architecture designed for high precision, scalability, and reliability. It moves beyond "naive RAG" by implementing:
|
| 10 |
+
- **Semantic Data Ingestion** (instead of fixed-size chunking)
|
| 11 |
+
- **Hybrid Multi-Strategy Retrieval** (Dense + Sparse + Knowledge Graph)
|
| 12 |
+
- **Corrective Gating (CRAG)** (to handle retrieval failures)
|
| 13 |
+
- **Reference-Free Evaluation** (using RAGAS)
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## 📥 2. Ingestion Layer: "Context-Aware Processing"
|
| 18 |
+
|
| 19 |
+
### **Document Loading**
|
| 20 |
+
- Supports multiple formats: PDF, HTML, and TXT.
|
| 21 |
+
- **Implementation:** `DocumentLoader` handles parsing and basic cleaning.
|
| 22 |
+
|
| 23 |
+
### **Semantic Chunker (`ingestion/chunker.py`)**
|
| 24 |
+
- **The Problem:** Fixed-size chunking (e.g., 512 tokens) often splits mid-sentence or mid-concept, losing semantic coherence.
|
| 25 |
+
- **The Solution:** We use **Sentence-Level Semantic Boundary Detection**.
|
| 26 |
+
- **How it works:**
|
| 27 |
+
1. Split text into individual sentences.
|
| 28 |
+
2. Embed each sentence using `BAAI/bge-small-en-v1.5`.
|
| 29 |
+
3. Compute the **cosine similarity** between consecutive sentence embeddings.
|
| 30 |
+
4. Insert a chunk boundary whenever the similarity drops below a certain threshold (e.g., 0.82) or the token limit is reached.
|
| 31 |
+
- **Why?** This ensures each chunk contains a single, coherent topic.
|
| 32 |
+
|
| 33 |
+
### **Parent-Child Hierarchy**
|
| 34 |
+
- **The Problem:** Small chunks are better for retrieval precision, but large chunks provide better context for generation.
|
| 35 |
+
- **The Implementation:**
|
| 36 |
+
- **Child Chunks (~256 tokens):** These are the units indexed in the vector database. They represent a specific "nugget" of information.
|
| 37 |
+
- **Parent Chunks (~1024 tokens):** A wider window of text centered on the child. When a child is retrieved, its **parent text** is what gets sent to the LLM.
|
| 38 |
+
- **Why?** It decouples **retrieval granularity** (find exactly what you need) from **context width** (give the LLM enough room to understand).
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## 🔍 3. Retrieval Layer: "Multi-Strategy Orchestration"
|
| 43 |
+
|
| 44 |
+
Cortex doesn't just rely on vector search; it uses a `MultiStrategyRetriever` to combine different search paradigms.
|
| 45 |
+
|
| 46 |
+
### **A. Dense Retrieval (Milvus)**
|
| 47 |
+
- **Embeddings:** `bge-small-en-v1.5` (384-dim).
|
| 48 |
+
- **Vector DB:** Milvus (Dockerized).
|
| 49 |
+
- **Indexing:** `IVF_FLAT` with `COSINE` similarity metric.
|
| 50 |
+
- **Why?** Captures semantic meaning (e.g., "puppy" matches "dog").
|
| 51 |
+
|
| 52 |
+
### **B. Sparse Retrieval (BM25)**
|
| 53 |
+
- **Implementation:** `rank_bm25` library.
|
| 54 |
+
- **Why?** Essential for exact keyword matching, acronyms, and specific names (e.g., "Project Cortex-X1") where vector search might be too "fuzzy".
|
| 55 |
+
|
| 56 |
+
### **C. Knowledge Graph (GraphRAG)**
|
| 57 |
+
- **Extraction:** During ingestion, we use **spaCy** for Named Entity Recognition (NER) and **REBEL** (or LLM) for relation extraction.
|
| 58 |
+
- **Storage:** A NetworkX graph storing triples: `(Subject) --[Predicate]--> (Object)`.
|
| 59 |
+
- **Retrieval:**
|
| 60 |
+
1. Extract entities from the user query.
|
| 61 |
+
2. Traverse the graph to find related nodes (multi-hop traversal).
|
| 62 |
+
3. Retrieve the chunks associated with those nodes.
|
| 63 |
+
- **Why?** Solves "multi-hop" queries where the answer requires connecting disparate pieces of information across the document.
|
| 64 |
+
|
| 65 |
+
### **D. Fusion & Reranking (`retrieval/fusion.py`)**
|
| 66 |
+
- **RRF (Reciprocal Rank Fusion):** Combines the ranked lists from Milvus, BM25, and the Graph into one unified list.
|
| 67 |
+
- **Cross-Encoder Reranker:** We take the top-15 fused candidates and run them through a Cross-Encoder (e.g., `BAAI/bge-reranker-base`).
|
| 68 |
+
- **Why?** Cross-encoders are much more accurate (but slower) than vector search because they look at the query and chunk simultaneously. Using them as a final "filter" boosts precision significantly.
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## 🧠 4. Generation Layer: "Corrective RAG (CRAG)"
|
| 73 |
+
|
| 74 |
+
The `CRAGGate` (`generation/crag.py`) acts as a "quality control" layer between retrieval and the LLM.
|
| 75 |
+
|
| 76 |
+
### **The CRAG Workflow**
|
| 77 |
+
1. **Grading:** An LLM-as-judge assesses if the retrieved chunks are relevant to the query.
|
| 78 |
+
2. **Action Categories:**
|
| 79 |
+
- **GOOD:** Chunks are relevant. Proceed to generation.
|
| 80 |
+
- **POOR:** Chunks are partially relevant. **Rewrite the query** (using CoT) and re-retrieve to find better results.
|
| 81 |
+
- **ABSENT:** Knowledge base doesn't have the answer. **Fallback to Web Search** (Tavily/DuckDuckGo).
|
| 82 |
+
3. **LLM Generation:** Uses Groq (Llama 3), OpenAI, or NVIDIA NIM to generate the final answer with **inline citations** (e.g., "The sky is blue [1].").
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
## 📊 5. Evaluation Layer: "Reference-Free Metrics"
|
| 87 |
+
|
| 88 |
+
Since production RAG systems often lack "ground truth" answers, we use the **RAGAS** framework (`evaluation/ragas_eval.py`).
|
| 89 |
+
|
| 90 |
+
### **Key Metrics**
|
| 91 |
+
- **Faithfulness:** Does the answer stay true to the retrieved context? (Prevents hallucinations).
|
| 92 |
+
- **Answer Relevancy:** Does the answer actually address the user's question?
|
| 93 |
+
- **Context Precision:** Were the retrieved chunks actually useful?
|
| 94 |
+
- **Context Utilisation:** What % of retrieved chunks were actually cited?
|
| 95 |
+
|
| 96 |
+
### **Implementation**
|
| 97 |
+
- Evaluations run **asynchronously** in background threads so they don't slow down the user's response time.
|
| 98 |
+
- Results are stored in a local SQLite DB for monitoring.
|
| 99 |
+
|
| 100 |
+
---
|
| 101 |
+
|
| 102 |
+
## 🛠️ 6. System & Infrastructure
|
| 103 |
+
|
| 104 |
+
- **API:** FastAPI for high-performance, asynchronous endpoints.
|
| 105 |
+
- **UI:** Streamlit for a clean, interactive dashboard (Ask, Ingest, Monitor).
|
| 106 |
+
- **Cache:** Redis for caching query results (TTL-based) to save LLM costs and latency.
|
| 107 |
+
- **Deployment:** Full **Docker Compose** setup for Milvus, Redis, API, and UI.
|
| 108 |
+
|
| 109 |
+
---
|
| 110 |
+
|
| 111 |
+
## 💡 Interview Tip: "Why this architecture?"
|
| 112 |
+
|
| 113 |
+
If asked why you built it this way, emphasize these three points:
|
| 114 |
+
1. **Precision:** By using **Semantic Chunking** and **Cross-Encoder Reranking**, we ensure only the most relevant context reaches the LLM.
|
| 115 |
+
2. **Reliability:** **CRAG** ensures the system doesn't hallucinate when the knowledge base is missing information.
|
| 116 |
+
3. **Observability:** By integrating **RAGAS**, we have an automated way to track performance and catch regressions.
|
| 117 |
+
|
| 118 |
+
Good luck with your interview! 🚀
|
api/main.py
CHANGED
|
@@ -32,6 +32,7 @@ from fastapi.middleware.cors import CORSMiddleware
|
|
| 32 |
from fastapi.responses import StreamingResponse
|
| 33 |
|
| 34 |
from api.schemas import (
|
|
|
|
| 35 |
HealthResponse,
|
| 36 |
IngestRequest,
|
| 37 |
IngestResponse,
|
|
@@ -324,8 +325,25 @@ async def query(req: QueryRequest) -> QueryResponse:
|
|
| 324 |
import time as _time
|
| 325 |
_t0 = _time.perf_counter()
|
| 326 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 327 |
try:
|
| 328 |
-
retrieval = _retriever.retrieve(
|
| 329 |
except Exception as exc:
|
| 330 |
logger.exception("Retrieval error")
|
| 331 |
raise HTTPException(status_code=500, detail=f"Retrieval failed: {exc}")
|
|
@@ -359,7 +377,7 @@ async def query(req: QueryRequest) -> QueryResponse:
|
|
| 359 |
try:
|
| 360 |
result = _generator.generate(
|
| 361 |
GenerationRequest(
|
| 362 |
-
query=
|
| 363 |
provider=llm_provider, model=llm_model,
|
| 364 |
api_key=llm_api_key, base_url=llm_base_url,
|
| 365 |
)
|
|
@@ -395,6 +413,7 @@ async def query(req: QueryRequest) -> QueryResponse:
|
|
| 395 |
return QueryResponse(
|
| 396 |
query=req.query,
|
| 397 |
answer=result.answer,
|
|
|
|
| 398 |
citations=[
|
| 399 |
CitationResponse(
|
| 400 |
number=c.number,
|
|
@@ -436,11 +455,27 @@ async def query_stream(req: QueryRequest):
|
|
| 436 |
cfg = get_settings()
|
| 437 |
k = req.top_k or cfg.retrieval_top_k
|
| 438 |
print(req)
|
|
|
|
| 439 |
async def event_stream() -> AsyncGenerator[str, None]:
|
| 440 |
try:
|
| 441 |
-
# 1.
|
| 442 |
-
|
| 443 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 444 |
final_chunks = result.chunks
|
| 445 |
|
| 446 |
# 2. Emit chunk metadata + routing decision so UI shows sources + strategy info immediately
|
|
@@ -458,6 +493,7 @@ async def query_stream(req: QueryRequest):
|
|
| 458 |
yield _sse_event({
|
| 459 |
"type": "chunk_meta",
|
| 460 |
"chunks": chunk_meta,
|
|
|
|
| 461 |
"routing": {
|
| 462 |
"intent": result.decision.intent.value,
|
| 463 |
"strategies": result.decision.strategies,
|
|
@@ -495,7 +531,8 @@ async def query_stream(req: QueryRequest):
|
|
| 495 |
# 4. Stream answer tokens
|
| 496 |
_llm = req.llm or {}
|
| 497 |
gen_request = GenerationRequest(
|
| 498 |
-
query=
|
|
|
|
| 499 |
provider=getattr(_llm, 'provider', None),
|
| 500 |
model=getattr(_llm, 'model', None),
|
| 501 |
api_key=getattr(_llm, 'api_key', None),
|
|
|
|
| 32 |
from fastapi.responses import StreamingResponse
|
| 33 |
|
| 34 |
from api.schemas import (
|
| 35 |
+
ConversationTurn,
|
| 36 |
HealthResponse,
|
| 37 |
IngestRequest,
|
| 38 |
IngestResponse,
|
|
|
|
| 325 |
import time as _time
|
| 326 |
_t0 = _time.perf_counter()
|
| 327 |
|
| 328 |
+
# ── Short-term memory: rewrite ambiguous follow-ups ────
|
| 329 |
+
conversation = [{"role": t.role, "content": t.content} for t in req.conversation]
|
| 330 |
+
effective_query = req.query
|
| 331 |
+
memory_rewritten = None
|
| 332 |
+
if conversation:
|
| 333 |
+
_llm_rw = req.llm or {}
|
| 334 |
+
_rewritten = _generator.rewrite_query(
|
| 335 |
+
query=req.query,
|
| 336 |
+
conversation=conversation,
|
| 337 |
+
provider=getattr(_llm_rw, 'provider', None),
|
| 338 |
+
model=getattr(_llm_rw, 'model', None),
|
| 339 |
+
api_key=getattr(_llm_rw, 'api_key', None),
|
| 340 |
+
)
|
| 341 |
+
if _rewritten != req.query:
|
| 342 |
+
effective_query = _rewritten
|
| 343 |
+
memory_rewritten = _rewritten
|
| 344 |
+
|
| 345 |
try:
|
| 346 |
+
retrieval = _retriever.retrieve(effective_query, top_k_candidates=k, final_top_k=cfg.final_top_k)
|
| 347 |
except Exception as exc:
|
| 348 |
logger.exception("Retrieval error")
|
| 349 |
raise HTTPException(status_code=500, detail=f"Retrieval failed: {exc}")
|
|
|
|
| 377 |
try:
|
| 378 |
result = _generator.generate(
|
| 379 |
GenerationRequest(
|
| 380 |
+
query=effective_query, chunks=final_chunks, conversation=conversation,
|
| 381 |
provider=llm_provider, model=llm_model,
|
| 382 |
api_key=llm_api_key, base_url=llm_base_url,
|
| 383 |
)
|
|
|
|
| 413 |
return QueryResponse(
|
| 414 |
query=req.query,
|
| 415 |
answer=result.answer,
|
| 416 |
+
memory_rewritten_query=memory_rewritten,
|
| 417 |
citations=[
|
| 418 |
CitationResponse(
|
| 419 |
number=c.number,
|
|
|
|
| 455 |
cfg = get_settings()
|
| 456 |
k = req.top_k or cfg.retrieval_top_k
|
| 457 |
print(req)
|
| 458 |
+
|
| 459 |
async def event_stream() -> AsyncGenerator[str, None]:
|
| 460 |
try:
|
| 461 |
+
# 1. Short-term memory: rewrite ambiguous follow-ups
|
| 462 |
+
_conv = [{"role": t.role, "content": t.content} for t in req.conversation]
|
| 463 |
+
_eff_query = req.query
|
| 464 |
+
_mem_rewrite = None
|
| 465 |
+
if _conv:
|
| 466 |
+
_llm_rw2 = req.llm or {}
|
| 467 |
+
_rw2 = _generator.rewrite_query(
|
| 468 |
+
query=req.query, conversation=_conv,
|
| 469 |
+
provider=getattr(_llm_rw2, 'provider', None),
|
| 470 |
+
model=getattr(_llm_rw2, 'model', None),
|
| 471 |
+
api_key=getattr(_llm_rw2, 'api_key', None),
|
| 472 |
+
)
|
| 473 |
+
if _rw2 != req.query:
|
| 474 |
+
_eff_query = _rw2
|
| 475 |
+
_mem_rewrite = _rw2
|
| 476 |
+
|
| 477 |
+
# 2. Multi-strategy retrieval: router → dense+BM25 → RRF → cross-encoder
|
| 478 |
+
result = _retriever.retrieve(_eff_query, top_k_candidates=k, final_top_k=cfg.final_top_k)
|
| 479 |
final_chunks = result.chunks
|
| 480 |
|
| 481 |
# 2. Emit chunk metadata + routing decision so UI shows sources + strategy info immediately
|
|
|
|
| 493 |
yield _sse_event({
|
| 494 |
"type": "chunk_meta",
|
| 495 |
"chunks": chunk_meta,
|
| 496 |
+
"memory_rewritten_query": _mem_rewrite,
|
| 497 |
"routing": {
|
| 498 |
"intent": result.decision.intent.value,
|
| 499 |
"strategies": result.decision.strategies,
|
|
|
|
| 531 |
# 4. Stream answer tokens
|
| 532 |
_llm = req.llm or {}
|
| 533 |
gen_request = GenerationRequest(
|
| 534 |
+
query=_eff_query, chunks=final_chunks, stream=True,
|
| 535 |
+
conversation=_conv,
|
| 536 |
provider=getattr(_llm, 'provider', None),
|
| 537 |
model=getattr(_llm, 'model', None),
|
| 538 |
api_key=getattr(_llm, 'api_key', None),
|
api/schemas.py
CHANGED
|
@@ -13,6 +13,10 @@ class LLMConfig(BaseModel):
|
|
| 13 |
api_key: Optional[str] = Field(default=None, description="API key override for this request")
|
| 14 |
base_url: Optional[str] = Field(default=None, description="Base URL (custom provider only)")
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
class QueryRequest(BaseModel):
|
| 18 |
query: str = Field(..., min_length=3, max_length=2048, description="User question")
|
|
@@ -55,6 +59,7 @@ class QueryResponse(BaseModel):
|
|
| 55 |
routing: Optional[RoutingResponse] = None
|
| 56 |
crag_grade: Optional[str] = None
|
| 57 |
crag_rewritten_query: Optional[str] = None
|
|
|
|
| 58 |
web_search_used: bool = False
|
| 59 |
model: str
|
| 60 |
usage: dict
|
|
|
|
| 13 |
api_key: Optional[str] = Field(default=None, description="API key override for this request")
|
| 14 |
base_url: Optional[str] = Field(default=None, description="Base URL (custom provider only)")
|
| 15 |
|
| 16 |
+
class ConversationTurn(BaseModel):
|
| 17 |
+
"""One turn of conversation history — sent from the UI for short-term memory."""
|
| 18 |
+
role: str # "user" | "assistant"
|
| 19 |
+
content: str # raw text (no markdown HTML)
|
| 20 |
|
| 21 |
class QueryRequest(BaseModel):
|
| 22 |
query: str = Field(..., min_length=3, max_length=2048, description="User question")
|
|
|
|
| 59 |
routing: Optional[RoutingResponse] = None
|
| 60 |
crag_grade: Optional[str] = None
|
| 61 |
crag_rewritten_query: Optional[str] = None
|
| 62 |
+
memory_rewritten_query: Optional[str] = None # set when rewritten for context resolution
|
| 63 |
web_search_used: bool = False
|
| 64 |
model: str
|
| 65 |
usage: dict
|
generation/generator.py
CHANGED
|
@@ -96,6 +96,8 @@ Rules you MUST follow:
|
|
| 96 |
"I don't have sufficient information in the knowledge base to answer this."
|
| 97 |
4. Keep your answer focused and precise. Use markdown formatting where helpful.
|
| 98 |
5. At the end of your response, list the cited sources under a "## Sources" heading.
|
|
|
|
|
|
|
| 99 |
"""
|
| 100 |
|
| 101 |
USER_PROMPT_TEMPLATE = """\
|
|
@@ -112,6 +114,21 @@ USER_PROMPT_TEMPLATE = """\
|
|
| 112 |
Answer based strictly on the context passages above. Include inline [N] citations.
|
| 113 |
"""
|
| 114 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
# ── Data classes ──────────────────────────────────────────────
|
| 117 |
|
|
@@ -120,8 +137,7 @@ class GenerationRequest:
|
|
| 120 |
query: str
|
| 121 |
chunks: list[RetrievedChunk]
|
| 122 |
stream: bool = True
|
| 123 |
-
#
|
| 124 |
-
provider: Optional[str] = None # e.g. "groq", "nvidia_nim", "openai", "custom"
|
| 125 |
model: Optional[str] = None # model id string
|
| 126 |
api_key: Optional[str] = None # override .env key for this request
|
| 127 |
base_url: Optional[str] = None # only used when provider == "custom"
|
|
@@ -156,6 +172,13 @@ class Generator:
|
|
| 156 |
and cached in a small dict to avoid redundant instantiation across
|
| 157 |
requests that share the same settings.
|
| 158 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
Streaming example:
|
| 160 |
gen = Generator()
|
| 161 |
for token in gen.stream(GenerationRequest(query, chunks)):
|
|
@@ -197,14 +220,14 @@ class Generator:
|
|
| 197 |
|
| 198 |
messages = self._build_messages(request)
|
| 199 |
|
| 200 |
-
|
| 201 |
model=resolved["model"],
|
| 202 |
messages=messages,
|
| 203 |
temperature=resolved["temperature"],
|
| 204 |
max_tokens=resolved["max_tokens"],
|
| 205 |
stream=True,
|
| 206 |
)
|
| 207 |
-
for chunk in
|
| 208 |
# Guard against empty choices — the final [DONE] sentinel chunk
|
| 209 |
# from some providers (e.g. NVIDIA NIM) arrives as choices:[].
|
| 210 |
if not chunk.choices:
|
|
@@ -214,6 +237,73 @@ class Generator:
|
|
| 214 |
if delta and delta.content:
|
| 215 |
yield delta.content
|
| 216 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 217 |
def build_sources_block(self, chunks: list[RetrievedChunk]) -> str:
|
| 218 |
"""
|
| 219 |
Returns a markdown sources block for appending after the streamed answer.
|
|
@@ -299,6 +389,35 @@ class Generator:
|
|
| 299 |
|
| 300 |
@staticmethod
|
| 301 |
def _build_messages(request: GenerationRequest) -> list[dict]:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 302 |
context_parts: list[str] = []
|
| 303 |
for i, chunk in enumerate(request.chunks, start=1):
|
| 304 |
# Use parent_text for LLM context (wider context window),
|
|
@@ -313,10 +432,10 @@ class Generator:
|
|
| 313 |
context=context_str,
|
| 314 |
query=request.query,
|
| 315 |
)
|
| 316 |
-
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
|
| 320 |
|
| 321 |
@staticmethod
|
| 322 |
def _build_citations(chunks: list[RetrievedChunk]) -> list[Citation]:
|
|
|
|
| 96 |
"I don't have sufficient information in the knowledge base to answer this."
|
| 97 |
4. Keep your answer focused and precise. Use markdown formatting where helpful.
|
| 98 |
5. At the end of your response, list the cited sources under a "## Sources" heading.
|
| 99 |
+
6. You have access to the conversation history above. Use it to resolve follow-up
|
| 100 |
+
references but always ground factual claims in the provided context passages.
|
| 101 |
"""
|
| 102 |
|
| 103 |
USER_PROMPT_TEMPLATE = """\
|
|
|
|
| 114 |
Answer based strictly on the context passages above. Include inline [N] citations.
|
| 115 |
"""
|
| 116 |
|
| 117 |
+
REWRITE_PROMPT = """\
|
| 118 |
+
You are a query rewriter for a retrieval system.
|
| 119 |
+
Given a conversation history and a follow-up question, rewrite the follow-up as a \
|
| 120 |
+
fully self-contained question that makes sense without the conversation history.
|
| 121 |
+
|
| 122 |
+
Rules:
|
| 123 |
+
- Resolve all pronouns (it, this, they, that, those, them) to their actual referents
|
| 124 |
+
- Expand vague references like "the first one", "that paper", "the approach above"
|
| 125 |
+
- If the question is already standalone and unambiguous, return it EXACTLY as-is
|
| 126 |
+
- Return ONLY the rewritten question — no explanation, no preamble
|
| 127 |
+
|
| 128 |
+
Conversation history:
|
| 129 |
+
{history}
|
| 130 |
+
|
| 131 |
+
Follow-up question: {query}"""
|
| 132 |
|
| 133 |
# ── Data classes ──────────────────────────────────────────────
|
| 134 |
|
|
|
|
| 137 |
query: str
|
| 138 |
chunks: list[RetrievedChunk]
|
| 139 |
stream: bool = True
|
| 140 |
+
conversation: list[dict] = field(default_factory=list) # [{role, content}, ...] provider: Optional[str] = None # e.g. "groq", "nvidia_nim", "openai", "custom"
|
|
|
|
| 141 |
model: Optional[str] = None # model id string
|
| 142 |
api_key: Optional[str] = None # override .env key for this request
|
| 143 |
base_url: Optional[str] = None # only used when provider == "custom"
|
|
|
|
| 172 |
and cached in a small dict to avoid redundant instantiation across
|
| 173 |
requests that share the same settings.
|
| 174 |
|
| 175 |
+
Memory is injected as prior conversation turns in the message list:
|
| 176 |
+
[system] → [user turn 1] → [assistant turn 1] → ... → [user + context]
|
| 177 |
+
|
| 178 |
+
The retrieval context (RAG passages) is attached only to the FINAL
|
| 179 |
+
user message. Prior turns are plain Q&A without context — the LLM
|
| 180 |
+
uses them purely to resolve pronouns and follow-up references.
|
| 181 |
+
|
| 182 |
Streaming example:
|
| 183 |
gen = Generator()
|
| 184 |
for token in gen.stream(GenerationRequest(query, chunks)):
|
|
|
|
| 220 |
|
| 221 |
messages = self._build_messages(request)
|
| 222 |
|
| 223 |
+
stream_obj = client.chat.completions.create(
|
| 224 |
model=resolved["model"],
|
| 225 |
messages=messages,
|
| 226 |
temperature=resolved["temperature"],
|
| 227 |
max_tokens=resolved["max_tokens"],
|
| 228 |
stream=True,
|
| 229 |
)
|
| 230 |
+
for chunk in stream_obj:
|
| 231 |
# Guard against empty choices — the final [DONE] sentinel chunk
|
| 232 |
# from some providers (e.g. NVIDIA NIM) arrives as choices:[].
|
| 233 |
if not chunk.choices:
|
|
|
|
| 237 |
if delta and delta.content:
|
| 238 |
yield delta.content
|
| 239 |
|
| 240 |
+
def rewrite_query(
|
| 241 |
+
self,
|
| 242 |
+
query: str,
|
| 243 |
+
conversation: list[dict],
|
| 244 |
+
provider: Optional[str] = None,
|
| 245 |
+
model: Optional[str] = None,
|
| 246 |
+
api_key: Optional[str] = None,
|
| 247 |
+
) -> str:
|
| 248 |
+
"""
|
| 249 |
+
Rewrite a follow-up query into a standalone question using conversation
|
| 250 |
+
history. Returns the original query unchanged if:
|
| 251 |
+
- There is no prior conversation (nothing to resolve)
|
| 252 |
+
- The rewrite call fails (safe fallback)
|
| 253 |
+
- The rewritten text is empty
|
| 254 |
+
|
| 255 |
+
Uses temperature=0 and max_tokens=200 — the cheapest possible call.
|
| 256 |
+
|
| 257 |
+
Example:
|
| 258 |
+
conversation = [
|
| 259 |
+
{"role": "user", "content": "What is the attention mechanism?"},
|
| 260 |
+
{"role": "assistant", "content": "Attention allows the model to ..."},
|
| 261 |
+
]
|
| 262 |
+
query = "Who invented it?"
|
| 263 |
+
→ "Who invented the attention mechanism?"
|
| 264 |
+
"""
|
| 265 |
+
if not conversation or len(conversation) < 2:
|
| 266 |
+
return query # no history — nothing to resolve
|
| 267 |
+
|
| 268 |
+
# Build a compact history string from the last 4 turns (2 exchanges)
|
| 269 |
+
# to keep the rewrite prompt short and fast
|
| 270 |
+
recent = conversation[-4:]
|
| 271 |
+
history_str = "\n".join(
|
| 272 |
+
f"{t['role'].upper()}: {t['content'][:300]}"
|
| 273 |
+
for t in recent
|
| 274 |
+
)
|
| 275 |
+
|
| 276 |
+
prompt = REWRITE_PROMPT.format(history=history_str, query=query)
|
| 277 |
+
|
| 278 |
+
try:
|
| 279 |
+
# Build a minimal request just for the rewrite call
|
| 280 |
+
class _MinimalReq:
|
| 281 |
+
provider = provider
|
| 282 |
+
model = model
|
| 283 |
+
api_key = api_key
|
| 284 |
+
base_url = None
|
| 285 |
+
|
| 286 |
+
client, resolved = self._resolve_client(_MinimalReq())
|
| 287 |
+
response = client.chat.completions.create(
|
| 288 |
+
model=resolved["model"],
|
| 289 |
+
messages=[{"role": "user", "content": prompt}],
|
| 290 |
+
temperature=0.0,
|
| 291 |
+
max_tokens=200,
|
| 292 |
+
stream=False,
|
| 293 |
+
)
|
| 294 |
+
rewritten = (response.choices[0].message.content or "").strip()
|
| 295 |
+
|
| 296 |
+
if rewritten and rewritten != query:
|
| 297 |
+
logger.info(
|
| 298 |
+
"Memory rewrite: '%s' → '%s'", query[:60], rewritten[:60]
|
| 299 |
+
)
|
| 300 |
+
return rewritten
|
| 301 |
+
|
| 302 |
+
except Exception as exc:
|
| 303 |
+
logger.debug("Query rewrite failed (%s) — using original query", exc)
|
| 304 |
+
|
| 305 |
+
return query
|
| 306 |
+
|
| 307 |
def build_sources_block(self, chunks: list[RetrievedChunk]) -> str:
|
| 308 |
"""
|
| 309 |
Returns a markdown sources block for appending after the streamed answer.
|
|
|
|
| 389 |
|
| 390 |
@staticmethod
|
| 391 |
def _build_messages(request: GenerationRequest) -> list[dict]:
|
| 392 |
+
"""
|
| 393 |
+
Build the full message list for the LLM call.
|
| 394 |
+
|
| 395 |
+
Structure with conversation history:
|
| 396 |
+
[system]
|
| 397 |
+
[user: prior question 1] ← conversation turns (no context)
|
| 398 |
+
[assistant: prior answer 1]
|
| 399 |
+
[user: prior question 2]
|
| 400 |
+
[assistant: prior answer 2]
|
| 401 |
+
...
|
| 402 |
+
[user: current question + RAG context passages]
|
| 403 |
+
|
| 404 |
+
Without conversation history (or first turn):
|
| 405 |
+
[system]
|
| 406 |
+
[user: current question + RAG context passages]
|
| 407 |
+
|
| 408 |
+
The RAG context is ONLY attached to the final user message.
|
| 409 |
+
Prior turns are plain Q&A — they exist solely so the LLM can
|
| 410 |
+
resolve pronouns and follow-up references from prior exchanges.
|
| 411 |
+
"""
|
| 412 |
+
messages: list[dict] = [{"role": "system", "content": SYSTEM_PROMPT}]
|
| 413 |
+
|
| 414 |
+
# Insert prior conversation turns (without context — plain Q&A)
|
| 415 |
+
for turn in request.conversation:
|
| 416 |
+
messages.append({"role": turn["role"], "content": turn["content"]})
|
| 417 |
+
|
| 418 |
+
# Final user message: current question + retrieved context
|
| 419 |
+
context_parts = []
|
| 420 |
+
|
| 421 |
context_parts: list[str] = []
|
| 422 |
for i, chunk in enumerate(request.chunks, start=1):
|
| 423 |
# Use parent_text for LLM context (wider context window),
|
|
|
|
| 432 |
context=context_str,
|
| 433 |
query=request.query,
|
| 434 |
)
|
| 435 |
+
messages.append({"role": "user", "content": user_content})
|
| 436 |
+
|
| 437 |
+
return messages
|
| 438 |
+
|
| 439 |
|
| 440 |
@staticmethod
|
| 441 |
def _build_citations(chunks: list[RetrievedChunk]) -> list[Citation]:
|
ui/static/script.js
CHANGED
|
@@ -138,6 +138,7 @@ const streamStatus=document.getElementById('streamStatus');
|
|
| 138 |
const sourcesList=document.getElementById('sourcesList');
|
| 139 |
let isStreaming=false;
|
| 140 |
let currentChunks=[];
|
|
|
|
| 141 |
|
| 142 |
chatInput.addEventListener('input',()=>{
|
| 143 |
chatInput.style.height='auto';
|
|
@@ -149,7 +150,7 @@ sendBtn.addEventListener('click',sendMessage);
|
|
| 149 |
document.getElementById('clearChatBtn').addEventListener('click',()=>{
|
| 150 |
chatMessages.innerHTML='<div class="message"><div class="msg-avatar ai">cx</div><div class="msg-body"><div class="msg-role">CORTEX</div><div class="msg-text">Cleared. Ask anything.</div></div></div>';
|
| 151 |
sourcesList.innerHTML='<div class="empty-sources"><svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5"><path d="M9 12h6m-6 4h6m2 5H7a2 2 0 01-2-2V5a2 2 0 012-2h5.586a1 1 0 01.707.293l5.414 5.414a1 1 0 01.293.707V19a2 2 0 01-2 2z"/></svg><span>Retrieved passages will appear here</span></div>';
|
| 152 |
-
streamStatus.textContent='';currentChunks=[];
|
| 153 |
});
|
| 154 |
|
| 155 |
function renderSourceCards(chunks){
|
|
@@ -205,13 +206,22 @@ async function sendMessage(){
|
|
| 205 |
streamStatus.textContent='…';
|
| 206 |
|
| 207 |
try{
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
/
|
| 213 |
-
|
| 214 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
if(!resp.ok) throw new Error('HTTP '+resp.status);
|
| 216 |
const reader=resp.body.getReader();
|
| 217 |
const decoder=new TextDecoder();
|
|
@@ -232,6 +242,11 @@ async function sendMessage(){
|
|
| 232 |
const routing=evt.routing||{};
|
| 233 |
renderSourceCards(chunks);
|
| 234 |
streamStatus.textContent='generating…';
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 235 |
if(routing.intent) addBadge(liveBadges,routing.intent,'amber');
|
| 236 |
(routing.strategies||[]).forEach(s=>addBadge(liveBadges,s.toUpperCase(),'blue'));
|
| 237 |
}
|
|
@@ -242,6 +257,7 @@ async function sendMessage(){
|
|
| 242 |
if(evt.rewritten_query) streamStatus.textContent='rewritten: "'+evt.rewritten_query.slice(0,50)+'…"';
|
| 243 |
}
|
| 244 |
else if(evt.type==='token'){
|
|
|
|
| 245 |
const tok=evt.text||'';
|
| 246 |
rawText+=tok;
|
| 247 |
cursor.before(document.createTextNode(tok));
|
|
@@ -280,6 +296,13 @@ async function sendMessage(){
|
|
| 280 |
}
|
| 281 |
|
| 282 |
liveText.removeAttribute('id');liveBadges.removeAttribute('id');
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 283 |
isStreaming=false;sendBtn.disabled=false;sendBtn.textContent='send';
|
| 284 |
chatMessages.scrollTop=chatMessages.scrollHeight;
|
| 285 |
}
|
|
|
|
| 138 |
const sourcesList=document.getElementById('sourcesList');
|
| 139 |
let isStreaming=false;
|
| 140 |
let currentChunks=[];
|
| 141 |
+
let chatHistory=[]; // [{role,content}] — short-term memory sent to API
|
| 142 |
|
| 143 |
chatInput.addEventListener('input',()=>{
|
| 144 |
chatInput.style.height='auto';
|
|
|
|
| 150 |
document.getElementById('clearChatBtn').addEventListener('click',()=>{
|
| 151 |
chatMessages.innerHTML='<div class="message"><div class="msg-avatar ai">cx</div><div class="msg-body"><div class="msg-role">CORTEX</div><div class="msg-text">Cleared. Ask anything.</div></div></div>';
|
| 152 |
sourcesList.innerHTML='<div class="empty-sources"><svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5"><path d="M9 12h6m-6 4h6m2 5H7a2 2 0 01-2-2V5a2 2 0 012-2h5.586a1 1 0 01.707.293l5.414 5.414a1 1 0 01.293.707V19a2 2 0 01-2 2z"/></svg><span>Retrieved passages will appear here</span></div>';
|
| 153 |
+
streamStatus.textContent='';currentChunks=[];chatHistory=[];
|
| 154 |
});
|
| 155 |
|
| 156 |
function renderSourceCards(chunks){
|
|
|
|
| 206 |
streamStatus.textContent='…';
|
| 207 |
|
| 208 |
try{
|
| 209 |
+
// Send last 6 turns (3 exchanges) as short-term memory
|
| 210 |
+
const historyWindow=chatHistory.slice(-6);
|
| 211 |
+
const resp=await fetch('/query/stream',{
|
| 212 |
+
method:'POST',
|
| 213 |
+
headers:{'Content-Type':'application/json'},
|
| 214 |
+
body:JSON.stringify({
|
| 215 |
+
query, top_k:10, stream:true,
|
| 216 |
+
conversation: historyWindow,
|
| 217 |
+
llm:{
|
| 218 |
+
provider: llmConfig.provider||null,
|
| 219 |
+
model: llmConfig.model||null,
|
| 220 |
+
api_key: llmConfig.api_key||null,
|
| 221 |
+
base_url: (llmConfig.provider==='custom'&&llmConfig.base_url)?llmConfig.base_url:null,
|
| 222 |
+
}
|
| 223 |
+
})
|
| 224 |
+
});
|
| 225 |
if(!resp.ok) throw new Error('HTTP '+resp.status);
|
| 226 |
const reader=resp.body.getReader();
|
| 227 |
const decoder=new TextDecoder();
|
|
|
|
| 242 |
const routing=evt.routing||{};
|
| 243 |
renderSourceCards(chunks);
|
| 244 |
streamStatus.textContent='generating…';
|
| 245 |
+
// Show memory rewrite notification if query was rewritten
|
| 246 |
+
if(evt.memory_rewritten_query){
|
| 247 |
+
addBadge(liveBadges,'↺ context resolved','purple');
|
| 248 |
+
streamStatus.textContent='rewritten: "'+evt.memory_rewritten_query.slice(0,55)+'…"';
|
| 249 |
+
}
|
| 250 |
if(routing.intent) addBadge(liveBadges,routing.intent,'amber');
|
| 251 |
(routing.strategies||[]).forEach(s=>addBadge(liveBadges,s.toUpperCase(),'blue'));
|
| 252 |
}
|
|
|
|
| 257 |
if(evt.rewritten_query) streamStatus.textContent='rewritten: "'+evt.rewritten_query.slice(0,50)+'…"';
|
| 258 |
}
|
| 259 |
else if(evt.type==='token'){
|
| 260 |
+
// Append text node directly before cursor — true per-token streaming
|
| 261 |
const tok=evt.text||'';
|
| 262 |
rawText+=tok;
|
| 263 |
cursor.before(document.createTextNode(tok));
|
|
|
|
| 296 |
}
|
| 297 |
|
| 298 |
liveText.removeAttribute('id');liveBadges.removeAttribute('id');
|
| 299 |
+
// Store this exchange in short-term memory (keep max 10 turns = 5 exchanges)
|
| 300 |
+
if(rawText){
|
| 301 |
+
chatHistory.push({role:'user', content:query});
|
| 302 |
+
chatHistory.push({role:'assistant',content:rawText});
|
| 303 |
+
if(chatHistory.length>10) chatHistory=chatHistory.slice(-10);
|
| 304 |
+
}
|
| 305 |
+
|
| 306 |
isStreaming=false;sendBtn.disabled=false;sendBtn.textContent='send';
|
| 307 |
chatMessages.scrollTop=chatMessages.scrollHeight;
|
| 308 |
}
|