aditya-joshi-05 commited on
Commit
a2f2da3
·
1 Parent(s): 9293399

Add conversational memory

Browse files
ARCHITECTURE_EXPLANATION.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Cortex RAG — Architecture & Implementation Guide
2
+
3
+ This document provides a deep dive into the architecture of **Cortex**, a production-grade Retrieval-Augmented Generation (RAG) system. This guide is structured to help you explain the "what", "how", and "why" of each layer during your GenAI Engineer interview.
4
+
5
+ ---
6
+
7
+ ## 🏗️ 1. High-Level Architecture Overview
8
+
9
+ Cortex follows a modular, multi-layer RAG architecture designed for high precision, scalability, and reliability. It moves beyond "naive RAG" by implementing:
10
+ - **Semantic Data Ingestion** (instead of fixed-size chunking)
11
+ - **Hybrid Multi-Strategy Retrieval** (Dense + Sparse + Knowledge Graph)
12
+ - **Corrective Gating (CRAG)** (to handle retrieval failures)
13
+ - **Reference-Free Evaluation** (using RAGAS)
14
+
15
+ ---
16
+
17
+ ## 📥 2. Ingestion Layer: "Context-Aware Processing"
18
+
19
+ ### **Document Loading**
20
+ - Supports multiple formats: PDF, HTML, and TXT.
21
+ - **Implementation:** `DocumentLoader` handles parsing and basic cleaning.
22
+
23
+ ### **Semantic Chunker (`ingestion/chunker.py`)**
24
+ - **The Problem:** Fixed-size chunking (e.g., 512 tokens) often splits mid-sentence or mid-concept, losing semantic coherence.
25
+ - **The Solution:** We use **Sentence-Level Semantic Boundary Detection**.
26
+ - **How it works:**
27
+ 1. Split text into individual sentences.
28
+ 2. Embed each sentence using `BAAI/bge-small-en-v1.5`.
29
+ 3. Compute the **cosine similarity** between consecutive sentence embeddings.
30
+ 4. Insert a chunk boundary whenever the similarity drops below a certain threshold (e.g., 0.82) or the token limit is reached.
31
+ - **Why?** This ensures each chunk contains a single, coherent topic.
32
+
33
+ ### **Parent-Child Hierarchy**
34
+ - **The Problem:** Small chunks are better for retrieval precision, but large chunks provide better context for generation.
35
+ - **The Implementation:**
36
+ - **Child Chunks (~256 tokens):** These are the units indexed in the vector database. They represent a specific "nugget" of information.
37
+ - **Parent Chunks (~1024 tokens):** A wider window of text centered on the child. When a child is retrieved, its **parent text** is what gets sent to the LLM.
38
+ - **Why?** It decouples **retrieval granularity** (find exactly what you need) from **context width** (give the LLM enough room to understand).
39
+
40
+ ---
41
+
42
+ ## 🔍 3. Retrieval Layer: "Multi-Strategy Orchestration"
43
+
44
+ Cortex doesn't just rely on vector search; it uses a `MultiStrategyRetriever` to combine different search paradigms.
45
+
46
+ ### **A. Dense Retrieval (Milvus)**
47
+ - **Embeddings:** `bge-small-en-v1.5` (384-dim).
48
+ - **Vector DB:** Milvus (Dockerized).
49
+ - **Indexing:** `IVF_FLAT` with `COSINE` similarity metric.
50
+ - **Why?** Captures semantic meaning (e.g., "puppy" matches "dog").
51
+
52
+ ### **B. Sparse Retrieval (BM25)**
53
+ - **Implementation:** `rank_bm25` library.
54
+ - **Why?** Essential for exact keyword matching, acronyms, and specific names (e.g., "Project Cortex-X1") where vector search might be too "fuzzy".
55
+
56
+ ### **C. Knowledge Graph (GraphRAG)**
57
+ - **Extraction:** During ingestion, we use **spaCy** for Named Entity Recognition (NER) and **REBEL** (or LLM) for relation extraction.
58
+ - **Storage:** A NetworkX graph storing triples: `(Subject) --[Predicate]--> (Object)`.
59
+ - **Retrieval:**
60
+ 1. Extract entities from the user query.
61
+ 2. Traverse the graph to find related nodes (multi-hop traversal).
62
+ 3. Retrieve the chunks associated with those nodes.
63
+ - **Why?** Solves "multi-hop" queries where the answer requires connecting disparate pieces of information across the document.
64
+
65
+ ### **D. Fusion & Reranking (`retrieval/fusion.py`)**
66
+ - **RRF (Reciprocal Rank Fusion):** Combines the ranked lists from Milvus, BM25, and the Graph into one unified list.
67
+ - **Cross-Encoder Reranker:** We take the top-15 fused candidates and run them through a Cross-Encoder (e.g., `BAAI/bge-reranker-base`).
68
+ - **Why?** Cross-encoders are much more accurate (but slower) than vector search because they look at the query and chunk simultaneously. Using them as a final "filter" boosts precision significantly.
69
+
70
+ ---
71
+
72
+ ## 🧠 4. Generation Layer: "Corrective RAG (CRAG)"
73
+
74
+ The `CRAGGate` (`generation/crag.py`) acts as a "quality control" layer between retrieval and the LLM.
75
+
76
+ ### **The CRAG Workflow**
77
+ 1. **Grading:** An LLM-as-judge assesses if the retrieved chunks are relevant to the query.
78
+ 2. **Action Categories:**
79
+ - **GOOD:** Chunks are relevant. Proceed to generation.
80
+ - **POOR:** Chunks are partially relevant. **Rewrite the query** (using CoT) and re-retrieve to find better results.
81
+ - **ABSENT:** Knowledge base doesn't have the answer. **Fallback to Web Search** (Tavily/DuckDuckGo).
82
+ 3. **LLM Generation:** Uses Groq (Llama 3), OpenAI, or NVIDIA NIM to generate the final answer with **inline citations** (e.g., "The sky is blue [1].").
83
+
84
+ ---
85
+
86
+ ## 📊 5. Evaluation Layer: "Reference-Free Metrics"
87
+
88
+ Since production RAG systems often lack "ground truth" answers, we use the **RAGAS** framework (`evaluation/ragas_eval.py`).
89
+
90
+ ### **Key Metrics**
91
+ - **Faithfulness:** Does the answer stay true to the retrieved context? (Prevents hallucinations).
92
+ - **Answer Relevancy:** Does the answer actually address the user's question?
93
+ - **Context Precision:** Were the retrieved chunks actually useful?
94
+ - **Context Utilisation:** What % of retrieved chunks were actually cited?
95
+
96
+ ### **Implementation**
97
+ - Evaluations run **asynchronously** in background threads so they don't slow down the user's response time.
98
+ - Results are stored in a local SQLite DB for monitoring.
99
+
100
+ ---
101
+
102
+ ## 🛠️ 6. System & Infrastructure
103
+
104
+ - **API:** FastAPI for high-performance, asynchronous endpoints.
105
+ - **UI:** Streamlit for a clean, interactive dashboard (Ask, Ingest, Monitor).
106
+ - **Cache:** Redis for caching query results (TTL-based) to save LLM costs and latency.
107
+ - **Deployment:** Full **Docker Compose** setup for Milvus, Redis, API, and UI.
108
+
109
+ ---
110
+
111
+ ## 💡 Interview Tip: "Why this architecture?"
112
+
113
+ If asked why you built it this way, emphasize these three points:
114
+ 1. **Precision:** By using **Semantic Chunking** and **Cross-Encoder Reranking**, we ensure only the most relevant context reaches the LLM.
115
+ 2. **Reliability:** **CRAG** ensures the system doesn't hallucinate when the knowledge base is missing information.
116
+ 3. **Observability:** By integrating **RAGAS**, we have an automated way to track performance and catch regressions.
117
+
118
+ Good luck with your interview! 🚀
api/main.py CHANGED
@@ -32,6 +32,7 @@ from fastapi.middleware.cors import CORSMiddleware
32
  from fastapi.responses import StreamingResponse
33
 
34
  from api.schemas import (
 
35
  HealthResponse,
36
  IngestRequest,
37
  IngestResponse,
@@ -324,8 +325,25 @@ async def query(req: QueryRequest) -> QueryResponse:
324
  import time as _time
325
  _t0 = _time.perf_counter()
326
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
327
  try:
328
- retrieval = _retriever.retrieve(req.query, top_k_candidates=k, final_top_k=cfg.final_top_k)
329
  except Exception as exc:
330
  logger.exception("Retrieval error")
331
  raise HTTPException(status_code=500, detail=f"Retrieval failed: {exc}")
@@ -359,7 +377,7 @@ async def query(req: QueryRequest) -> QueryResponse:
359
  try:
360
  result = _generator.generate(
361
  GenerationRequest(
362
- query=req.query, chunks=final_chunks,
363
  provider=llm_provider, model=llm_model,
364
  api_key=llm_api_key, base_url=llm_base_url,
365
  )
@@ -395,6 +413,7 @@ async def query(req: QueryRequest) -> QueryResponse:
395
  return QueryResponse(
396
  query=req.query,
397
  answer=result.answer,
 
398
  citations=[
399
  CitationResponse(
400
  number=c.number,
@@ -436,11 +455,27 @@ async def query_stream(req: QueryRequest):
436
  cfg = get_settings()
437
  k = req.top_k or cfg.retrieval_top_k
438
  print(req)
 
439
  async def event_stream() -> AsyncGenerator[str, None]:
440
  try:
441
- # 1. Retrieve
442
- # 1. Multi-strategy retrieval: router dense+BM25 RRF cross-encoder
443
- result = _retriever.retrieve(req.query, top_k_candidates=k, final_top_k=cfg.final_top_k)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
444
  final_chunks = result.chunks
445
 
446
  # 2. Emit chunk metadata + routing decision so UI shows sources + strategy info immediately
@@ -458,6 +493,7 @@ async def query_stream(req: QueryRequest):
458
  yield _sse_event({
459
  "type": "chunk_meta",
460
  "chunks": chunk_meta,
 
461
  "routing": {
462
  "intent": result.decision.intent.value,
463
  "strategies": result.decision.strategies,
@@ -495,7 +531,8 @@ async def query_stream(req: QueryRequest):
495
  # 4. Stream answer tokens
496
  _llm = req.llm or {}
497
  gen_request = GenerationRequest(
498
- query=req.query, chunks=final_chunks, stream=True,
 
499
  provider=getattr(_llm, 'provider', None),
500
  model=getattr(_llm, 'model', None),
501
  api_key=getattr(_llm, 'api_key', None),
 
32
  from fastapi.responses import StreamingResponse
33
 
34
  from api.schemas import (
35
+ ConversationTurn,
36
  HealthResponse,
37
  IngestRequest,
38
  IngestResponse,
 
325
  import time as _time
326
  _t0 = _time.perf_counter()
327
 
328
+ # ── Short-term memory: rewrite ambiguous follow-ups ────
329
+ conversation = [{"role": t.role, "content": t.content} for t in req.conversation]
330
+ effective_query = req.query
331
+ memory_rewritten = None
332
+ if conversation:
333
+ _llm_rw = req.llm or {}
334
+ _rewritten = _generator.rewrite_query(
335
+ query=req.query,
336
+ conversation=conversation,
337
+ provider=getattr(_llm_rw, 'provider', None),
338
+ model=getattr(_llm_rw, 'model', None),
339
+ api_key=getattr(_llm_rw, 'api_key', None),
340
+ )
341
+ if _rewritten != req.query:
342
+ effective_query = _rewritten
343
+ memory_rewritten = _rewritten
344
+
345
  try:
346
+ retrieval = _retriever.retrieve(effective_query, top_k_candidates=k, final_top_k=cfg.final_top_k)
347
  except Exception as exc:
348
  logger.exception("Retrieval error")
349
  raise HTTPException(status_code=500, detail=f"Retrieval failed: {exc}")
 
377
  try:
378
  result = _generator.generate(
379
  GenerationRequest(
380
+ query=effective_query, chunks=final_chunks, conversation=conversation,
381
  provider=llm_provider, model=llm_model,
382
  api_key=llm_api_key, base_url=llm_base_url,
383
  )
 
413
  return QueryResponse(
414
  query=req.query,
415
  answer=result.answer,
416
+ memory_rewritten_query=memory_rewritten,
417
  citations=[
418
  CitationResponse(
419
  number=c.number,
 
455
  cfg = get_settings()
456
  k = req.top_k or cfg.retrieval_top_k
457
  print(req)
458
+
459
  async def event_stream() -> AsyncGenerator[str, None]:
460
  try:
461
+ # 1. Short-term memory: rewrite ambiguous follow-ups
462
+ _conv = [{"role": t.role, "content": t.content} for t in req.conversation]
463
+ _eff_query = req.query
464
+ _mem_rewrite = None
465
+ if _conv:
466
+ _llm_rw2 = req.llm or {}
467
+ _rw2 = _generator.rewrite_query(
468
+ query=req.query, conversation=_conv,
469
+ provider=getattr(_llm_rw2, 'provider', None),
470
+ model=getattr(_llm_rw2, 'model', None),
471
+ api_key=getattr(_llm_rw2, 'api_key', None),
472
+ )
473
+ if _rw2 != req.query:
474
+ _eff_query = _rw2
475
+ _mem_rewrite = _rw2
476
+
477
+ # 2. Multi-strategy retrieval: router → dense+BM25 → RRF → cross-encoder
478
+ result = _retriever.retrieve(_eff_query, top_k_candidates=k, final_top_k=cfg.final_top_k)
479
  final_chunks = result.chunks
480
 
481
  # 2. Emit chunk metadata + routing decision so UI shows sources + strategy info immediately
 
493
  yield _sse_event({
494
  "type": "chunk_meta",
495
  "chunks": chunk_meta,
496
+ "memory_rewritten_query": _mem_rewrite,
497
  "routing": {
498
  "intent": result.decision.intent.value,
499
  "strategies": result.decision.strategies,
 
531
  # 4. Stream answer tokens
532
  _llm = req.llm or {}
533
  gen_request = GenerationRequest(
534
+ query=_eff_query, chunks=final_chunks, stream=True,
535
+ conversation=_conv,
536
  provider=getattr(_llm, 'provider', None),
537
  model=getattr(_llm, 'model', None),
538
  api_key=getattr(_llm, 'api_key', None),
api/schemas.py CHANGED
@@ -13,6 +13,10 @@ class LLMConfig(BaseModel):
13
  api_key: Optional[str] = Field(default=None, description="API key override for this request")
14
  base_url: Optional[str] = Field(default=None, description="Base URL (custom provider only)")
15
 
 
 
 
 
16
 
17
  class QueryRequest(BaseModel):
18
  query: str = Field(..., min_length=3, max_length=2048, description="User question")
@@ -55,6 +59,7 @@ class QueryResponse(BaseModel):
55
  routing: Optional[RoutingResponse] = None
56
  crag_grade: Optional[str] = None
57
  crag_rewritten_query: Optional[str] = None
 
58
  web_search_used: bool = False
59
  model: str
60
  usage: dict
 
13
  api_key: Optional[str] = Field(default=None, description="API key override for this request")
14
  base_url: Optional[str] = Field(default=None, description="Base URL (custom provider only)")
15
 
16
+ class ConversationTurn(BaseModel):
17
+ """One turn of conversation history — sent from the UI for short-term memory."""
18
+ role: str # "user" | "assistant"
19
+ content: str # raw text (no markdown HTML)
20
 
21
  class QueryRequest(BaseModel):
22
  query: str = Field(..., min_length=3, max_length=2048, description="User question")
 
59
  routing: Optional[RoutingResponse] = None
60
  crag_grade: Optional[str] = None
61
  crag_rewritten_query: Optional[str] = None
62
+ memory_rewritten_query: Optional[str] = None # set when rewritten for context resolution
63
  web_search_used: bool = False
64
  model: str
65
  usage: dict
generation/generator.py CHANGED
@@ -96,6 +96,8 @@ Rules you MUST follow:
96
  "I don't have sufficient information in the knowledge base to answer this."
97
  4. Keep your answer focused and precise. Use markdown formatting where helpful.
98
  5. At the end of your response, list the cited sources under a "## Sources" heading.
 
 
99
  """
100
 
101
  USER_PROMPT_TEMPLATE = """\
@@ -112,6 +114,21 @@ USER_PROMPT_TEMPLATE = """\
112
  Answer based strictly on the context passages above. Include inline [N] citations.
113
  """
114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
  # ── Data classes ──────────────────────────────────────────────
117
 
@@ -120,8 +137,7 @@ class GenerationRequest:
120
  query: str
121
  chunks: list[RetrievedChunk]
122
  stream: bool = True
123
- # Runtime overrides sent from the UI model selector
124
- provider: Optional[str] = None # e.g. "groq", "nvidia_nim", "openai", "custom"
125
  model: Optional[str] = None # model id string
126
  api_key: Optional[str] = None # override .env key for this request
127
  base_url: Optional[str] = None # only used when provider == "custom"
@@ -156,6 +172,13 @@ class Generator:
156
  and cached in a small dict to avoid redundant instantiation across
157
  requests that share the same settings.
158
 
 
 
 
 
 
 
 
159
  Streaming example:
160
  gen = Generator()
161
  for token in gen.stream(GenerationRequest(query, chunks)):
@@ -197,14 +220,14 @@ class Generator:
197
 
198
  messages = self._build_messages(request)
199
 
200
- stream = client.chat.completions.create(
201
  model=resolved["model"],
202
  messages=messages,
203
  temperature=resolved["temperature"],
204
  max_tokens=resolved["max_tokens"],
205
  stream=True,
206
  )
207
- for chunk in stream:
208
  # Guard against empty choices — the final [DONE] sentinel chunk
209
  # from some providers (e.g. NVIDIA NIM) arrives as choices:[].
210
  if not chunk.choices:
@@ -214,6 +237,73 @@ class Generator:
214
  if delta and delta.content:
215
  yield delta.content
216
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
217
  def build_sources_block(self, chunks: list[RetrievedChunk]) -> str:
218
  """
219
  Returns a markdown sources block for appending after the streamed answer.
@@ -299,6 +389,35 @@ class Generator:
299
 
300
  @staticmethod
301
  def _build_messages(request: GenerationRequest) -> list[dict]:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
302
  context_parts: list[str] = []
303
  for i, chunk in enumerate(request.chunks, start=1):
304
  # Use parent_text for LLM context (wider context window),
@@ -313,10 +432,10 @@ class Generator:
313
  context=context_str,
314
  query=request.query,
315
  )
316
- return [
317
- {"role": "system", "content": SYSTEM_PROMPT},
318
- {"role": "user", "content": user_content},
319
- ]
320
 
321
  @staticmethod
322
  def _build_citations(chunks: list[RetrievedChunk]) -> list[Citation]:
 
96
  "I don't have sufficient information in the knowledge base to answer this."
97
  4. Keep your answer focused and precise. Use markdown formatting where helpful.
98
  5. At the end of your response, list the cited sources under a "## Sources" heading.
99
+ 6. You have access to the conversation history above. Use it to resolve follow-up
100
+ references but always ground factual claims in the provided context passages.
101
  """
102
 
103
  USER_PROMPT_TEMPLATE = """\
 
114
  Answer based strictly on the context passages above. Include inline [N] citations.
115
  """
116
 
117
+ REWRITE_PROMPT = """\
118
+ You are a query rewriter for a retrieval system.
119
+ Given a conversation history and a follow-up question, rewrite the follow-up as a \
120
+ fully self-contained question that makes sense without the conversation history.
121
+
122
+ Rules:
123
+ - Resolve all pronouns (it, this, they, that, those, them) to their actual referents
124
+ - Expand vague references like "the first one", "that paper", "the approach above"
125
+ - If the question is already standalone and unambiguous, return it EXACTLY as-is
126
+ - Return ONLY the rewritten question — no explanation, no preamble
127
+
128
+ Conversation history:
129
+ {history}
130
+
131
+ Follow-up question: {query}"""
132
 
133
  # ── Data classes ──────────────────────────────────────────────
134
 
 
137
  query: str
138
  chunks: list[RetrievedChunk]
139
  stream: bool = True
140
+ conversation: list[dict] = field(default_factory=list) # [{role, content}, ...] provider: Optional[str] = None # e.g. "groq", "nvidia_nim", "openai", "custom"
 
141
  model: Optional[str] = None # model id string
142
  api_key: Optional[str] = None # override .env key for this request
143
  base_url: Optional[str] = None # only used when provider == "custom"
 
172
  and cached in a small dict to avoid redundant instantiation across
173
  requests that share the same settings.
174
 
175
+ Memory is injected as prior conversation turns in the message list:
176
+ [system] → [user turn 1] → [assistant turn 1] → ... → [user + context]
177
+
178
+ The retrieval context (RAG passages) is attached only to the FINAL
179
+ user message. Prior turns are plain Q&A without context — the LLM
180
+ uses them purely to resolve pronouns and follow-up references.
181
+
182
  Streaming example:
183
  gen = Generator()
184
  for token in gen.stream(GenerationRequest(query, chunks)):
 
220
 
221
  messages = self._build_messages(request)
222
 
223
+ stream_obj = client.chat.completions.create(
224
  model=resolved["model"],
225
  messages=messages,
226
  temperature=resolved["temperature"],
227
  max_tokens=resolved["max_tokens"],
228
  stream=True,
229
  )
230
+ for chunk in stream_obj:
231
  # Guard against empty choices — the final [DONE] sentinel chunk
232
  # from some providers (e.g. NVIDIA NIM) arrives as choices:[].
233
  if not chunk.choices:
 
237
  if delta and delta.content:
238
  yield delta.content
239
 
240
+ def rewrite_query(
241
+ self,
242
+ query: str,
243
+ conversation: list[dict],
244
+ provider: Optional[str] = None,
245
+ model: Optional[str] = None,
246
+ api_key: Optional[str] = None,
247
+ ) -> str:
248
+ """
249
+ Rewrite a follow-up query into a standalone question using conversation
250
+ history. Returns the original query unchanged if:
251
+ - There is no prior conversation (nothing to resolve)
252
+ - The rewrite call fails (safe fallback)
253
+ - The rewritten text is empty
254
+
255
+ Uses temperature=0 and max_tokens=200 — the cheapest possible call.
256
+
257
+ Example:
258
+ conversation = [
259
+ {"role": "user", "content": "What is the attention mechanism?"},
260
+ {"role": "assistant", "content": "Attention allows the model to ..."},
261
+ ]
262
+ query = "Who invented it?"
263
+ → "Who invented the attention mechanism?"
264
+ """
265
+ if not conversation or len(conversation) < 2:
266
+ return query # no history — nothing to resolve
267
+
268
+ # Build a compact history string from the last 4 turns (2 exchanges)
269
+ # to keep the rewrite prompt short and fast
270
+ recent = conversation[-4:]
271
+ history_str = "\n".join(
272
+ f"{t['role'].upper()}: {t['content'][:300]}"
273
+ for t in recent
274
+ )
275
+
276
+ prompt = REWRITE_PROMPT.format(history=history_str, query=query)
277
+
278
+ try:
279
+ # Build a minimal request just for the rewrite call
280
+ class _MinimalReq:
281
+ provider = provider
282
+ model = model
283
+ api_key = api_key
284
+ base_url = None
285
+
286
+ client, resolved = self._resolve_client(_MinimalReq())
287
+ response = client.chat.completions.create(
288
+ model=resolved["model"],
289
+ messages=[{"role": "user", "content": prompt}],
290
+ temperature=0.0,
291
+ max_tokens=200,
292
+ stream=False,
293
+ )
294
+ rewritten = (response.choices[0].message.content or "").strip()
295
+
296
+ if rewritten and rewritten != query:
297
+ logger.info(
298
+ "Memory rewrite: '%s' → '%s'", query[:60], rewritten[:60]
299
+ )
300
+ return rewritten
301
+
302
+ except Exception as exc:
303
+ logger.debug("Query rewrite failed (%s) — using original query", exc)
304
+
305
+ return query
306
+
307
  def build_sources_block(self, chunks: list[RetrievedChunk]) -> str:
308
  """
309
  Returns a markdown sources block for appending after the streamed answer.
 
389
 
390
  @staticmethod
391
  def _build_messages(request: GenerationRequest) -> list[dict]:
392
+ """
393
+ Build the full message list for the LLM call.
394
+
395
+ Structure with conversation history:
396
+ [system]
397
+ [user: prior question 1] ← conversation turns (no context)
398
+ [assistant: prior answer 1]
399
+ [user: prior question 2]
400
+ [assistant: prior answer 2]
401
+ ...
402
+ [user: current question + RAG context passages]
403
+
404
+ Without conversation history (or first turn):
405
+ [system]
406
+ [user: current question + RAG context passages]
407
+
408
+ The RAG context is ONLY attached to the final user message.
409
+ Prior turns are plain Q&A — they exist solely so the LLM can
410
+ resolve pronouns and follow-up references from prior exchanges.
411
+ """
412
+ messages: list[dict] = [{"role": "system", "content": SYSTEM_PROMPT}]
413
+
414
+ # Insert prior conversation turns (without context — plain Q&A)
415
+ for turn in request.conversation:
416
+ messages.append({"role": turn["role"], "content": turn["content"]})
417
+
418
+ # Final user message: current question + retrieved context
419
+ context_parts = []
420
+
421
  context_parts: list[str] = []
422
  for i, chunk in enumerate(request.chunks, start=1):
423
  # Use parent_text for LLM context (wider context window),
 
432
  context=context_str,
433
  query=request.query,
434
  )
435
+ messages.append({"role": "user", "content": user_content})
436
+
437
+ return messages
438
+
439
 
440
  @staticmethod
441
  def _build_citations(chunks: list[RetrievedChunk]) -> list[Citation]:
ui/static/script.js CHANGED
@@ -138,6 +138,7 @@ const streamStatus=document.getElementById('streamStatus');
138
  const sourcesList=document.getElementById('sourcesList');
139
  let isStreaming=false;
140
  let currentChunks=[];
 
141
 
142
  chatInput.addEventListener('input',()=>{
143
  chatInput.style.height='auto';
@@ -149,7 +150,7 @@ sendBtn.addEventListener('click',sendMessage);
149
  document.getElementById('clearChatBtn').addEventListener('click',()=>{
150
  chatMessages.innerHTML='<div class="message"><div class="msg-avatar ai">cx</div><div class="msg-body"><div class="msg-role">CORTEX</div><div class="msg-text">Cleared. Ask anything.</div></div></div>';
151
  sourcesList.innerHTML='<div class="empty-sources"><svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5"><path d="M9 12h6m-6 4h6m2 5H7a2 2 0 01-2-2V5a2 2 0 012-2h5.586a1 1 0 01.707.293l5.414 5.414a1 1 0 01.293.707V19a2 2 0 01-2 2z"/></svg><span>Retrieved passages will appear here</span></div>';
152
- streamStatus.textContent='';currentChunks=[];
153
  });
154
 
155
  function renderSourceCards(chunks){
@@ -205,13 +206,22 @@ async function sendMessage(){
205
  streamStatus.textContent='…';
206
 
207
  try{
208
- const resp=await fetch('/query/stream',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({query,top_k:10,stream:true,llm:{
209
- provider: llmConfig.provider||null,
210
- model: llmConfig.model||null,
211
- api_key: llmConfig.api_key||null,
212
- // Only send base_url for custom — server ignores it for known providers
213
- base_url: (llmConfig.provider==='custom' && llmConfig.base_url) ? llmConfig.base_url : null,
214
- }})});
 
 
 
 
 
 
 
 
 
215
  if(!resp.ok) throw new Error('HTTP '+resp.status);
216
  const reader=resp.body.getReader();
217
  const decoder=new TextDecoder();
@@ -232,6 +242,11 @@ async function sendMessage(){
232
  const routing=evt.routing||{};
233
  renderSourceCards(chunks);
234
  streamStatus.textContent='generating…';
 
 
 
 
 
235
  if(routing.intent) addBadge(liveBadges,routing.intent,'amber');
236
  (routing.strategies||[]).forEach(s=>addBadge(liveBadges,s.toUpperCase(),'blue'));
237
  }
@@ -242,6 +257,7 @@ async function sendMessage(){
242
  if(evt.rewritten_query) streamStatus.textContent='rewritten: "'+evt.rewritten_query.slice(0,50)+'…"';
243
  }
244
  else if(evt.type==='token'){
 
245
  const tok=evt.text||'';
246
  rawText+=tok;
247
  cursor.before(document.createTextNode(tok));
@@ -280,6 +296,13 @@ async function sendMessage(){
280
  }
281
 
282
  liveText.removeAttribute('id');liveBadges.removeAttribute('id');
 
 
 
 
 
 
 
283
  isStreaming=false;sendBtn.disabled=false;sendBtn.textContent='send';
284
  chatMessages.scrollTop=chatMessages.scrollHeight;
285
  }
 
138
  const sourcesList=document.getElementById('sourcesList');
139
  let isStreaming=false;
140
  let currentChunks=[];
141
+ let chatHistory=[]; // [{role,content}] — short-term memory sent to API
142
 
143
  chatInput.addEventListener('input',()=>{
144
  chatInput.style.height='auto';
 
150
  document.getElementById('clearChatBtn').addEventListener('click',()=>{
151
  chatMessages.innerHTML='<div class="message"><div class="msg-avatar ai">cx</div><div class="msg-body"><div class="msg-role">CORTEX</div><div class="msg-text">Cleared. Ask anything.</div></div></div>';
152
  sourcesList.innerHTML='<div class="empty-sources"><svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5"><path d="M9 12h6m-6 4h6m2 5H7a2 2 0 01-2-2V5a2 2 0 012-2h5.586a1 1 0 01.707.293l5.414 5.414a1 1 0 01.293.707V19a2 2 0 01-2 2z"/></svg><span>Retrieved passages will appear here</span></div>';
153
+ streamStatus.textContent='';currentChunks=[];chatHistory=[];
154
  });
155
 
156
  function renderSourceCards(chunks){
 
206
  streamStatus.textContent='…';
207
 
208
  try{
209
+ // Send last 6 turns (3 exchanges) as short-term memory
210
+ const historyWindow=chatHistory.slice(-6);
211
+ const resp=await fetch('/query/stream',{
212
+ method:'POST',
213
+ headers:{'Content-Type':'application/json'},
214
+ body:JSON.stringify({
215
+ query, top_k:10, stream:true,
216
+ conversation: historyWindow,
217
+ llm:{
218
+ provider: llmConfig.provider||null,
219
+ model: llmConfig.model||null,
220
+ api_key: llmConfig.api_key||null,
221
+ base_url: (llmConfig.provider==='custom'&&llmConfig.base_url)?llmConfig.base_url:null,
222
+ }
223
+ })
224
+ });
225
  if(!resp.ok) throw new Error('HTTP '+resp.status);
226
  const reader=resp.body.getReader();
227
  const decoder=new TextDecoder();
 
242
  const routing=evt.routing||{};
243
  renderSourceCards(chunks);
244
  streamStatus.textContent='generating…';
245
+ // Show memory rewrite notification if query was rewritten
246
+ if(evt.memory_rewritten_query){
247
+ addBadge(liveBadges,'↺ context resolved','purple');
248
+ streamStatus.textContent='rewritten: "'+evt.memory_rewritten_query.slice(0,55)+'…"';
249
+ }
250
  if(routing.intent) addBadge(liveBadges,routing.intent,'amber');
251
  (routing.strategies||[]).forEach(s=>addBadge(liveBadges,s.toUpperCase(),'blue'));
252
  }
 
257
  if(evt.rewritten_query) streamStatus.textContent='rewritten: "'+evt.rewritten_query.slice(0,50)+'…"';
258
  }
259
  else if(evt.type==='token'){
260
+ // Append text node directly before cursor — true per-token streaming
261
  const tok=evt.text||'';
262
  rawText+=tok;
263
  cursor.before(document.createTextNode(tok));
 
296
  }
297
 
298
  liveText.removeAttribute('id');liveBadges.removeAttribute('id');
299
+ // Store this exchange in short-term memory (keep max 10 turns = 5 exchanges)
300
+ if(rawText){
301
+ chatHistory.push({role:'user', content:query});
302
+ chatHistory.push({role:'assistant',content:rawText});
303
+ if(chatHistory.length>10) chatHistory=chatHistory.slice(-10);
304
+ }
305
+
306
  isStreaming=false;sendBtn.disabled=false;sendBtn.textContent='send';
307
  chatMessages.scrollTop=chatMessages.scrollHeight;
308
  }