Spaces:

aditya-joshi-05
/

Cortex

Running

App Files Files Community

aditya-joshi-05 commited on 14 days ago

Commit

a2f2da3

1 Parent(s): 9293399

Add conversational memory

Browse files

Files changed (5) hide show

ARCHITECTURE_EXPLANATION.md +118 -0
api/main.py +43 -6
api/schemas.py +5 -0
generation/generator.py +127 -8
ui/static/script.js +31 -8

ARCHITECTURE_EXPLANATION.md ADDED Viewed

	@@ -0,0 +1,118 @@

+# Cortex RAG — Architecture & Implementation Guide
+This document provides a deep dive into the architecture of **Cortex**, a production-grade Retrieval-Augmented Generation (RAG) system. This guide is structured to help you explain the "what", "how", and "why" of each layer during your GenAI Engineer interview.
+---
+## 🏗️ 1. High-Level Architecture Overview
+Cortex follows a modular, multi-layer RAG architecture designed for high precision, scalability, and reliability. It moves beyond "naive RAG" by implementing:
+- **Semantic Data Ingestion** (instead of fixed-size chunking)
+- **Hybrid Multi-Strategy Retrieval** (Dense + Sparse + Knowledge Graph)
+- **Corrective Gating (CRAG)** (to handle retrieval failures)
+- **Reference-Free Evaluation** (using RAGAS)
+---
+## 📥 2. Ingestion Layer: "Context-Aware Processing"
+### **Document Loading**
+- Supports multiple formats: PDF, HTML, and TXT.
+- **Implementation:** `DocumentLoader` handles parsing and basic cleaning.
+### **Semantic Chunker (`ingestion/chunker.py`)**
+- **The Problem:** Fixed-size chunking (e.g., 512 tokens) often splits mid-sentence or mid-concept, losing semantic coherence.
+- **The Solution:** We use **Sentence-Level Semantic Boundary Detection**.
+- **How it works:**
+    1. Split text into individual sentences.
+    2. Embed each sentence using `BAAI/bge-small-en-v1.5`.
+    3. Compute the **cosine similarity** between consecutive sentence embeddings.
+    4. Insert a chunk boundary whenever the similarity drops below a certain threshold (e.g., 0.82) or the token limit is reached.
+- **Why?** This ensures each chunk contains a single, coherent topic.
+### **Parent-Child Hierarchy**
+- **The Problem:** Small chunks are better for retrieval precision, but large chunks provide better context for generation.
+- **The Implementation:**
+    - **Child Chunks (~256 tokens):** These are the units indexed in the vector database. They represent a specific "nugget" of information.
+    - **Parent Chunks (~1024 tokens):** A wider window of text centered on the child. When a child is retrieved, its **parent text** is what gets sent to the LLM.
+- **Why?** It decouples **retrieval granularity** (find exactly what you need) from **context width** (give the LLM enough room to understand).
+---
+## 🔍 3. Retrieval Layer: "Multi-Strategy Orchestration"
+Cortex doesn't just rely on vector search; it uses a `MultiStrategyRetriever` to combine different search paradigms.
+### **A. Dense Retrieval (Milvus)**
+- **Embeddings:** `bge-small-en-v1.5` (384-dim).
+- **Vector DB:** Milvus (Dockerized).
+- **Indexing:** `IVF_FLAT` with `COSINE` similarity metric.
+- **Why?** Captures semantic meaning (e.g., "puppy" matches "dog").
+### **B. Sparse Retrieval (BM25)**
+- **Implementation:** `rank_bm25` library.
+- **Why?** Essential for exact keyword matching, acronyms, and specific names (e.g., "Project Cortex-X1") where vector search might be too "fuzzy".
+### **C. Knowledge Graph (GraphRAG)**
+- **Extraction:** During ingestion, we use **spaCy** for Named Entity Recognition (NER) and **REBEL** (or LLM) for relation extraction.
+- **Storage:** A NetworkX graph storing triples: `(Subject) --[Predicate]--> (Object)`.
+- **Retrieval:**
+    1. Extract entities from the user query.
+    2. Traverse the graph to find related nodes (multi-hop traversal).
+    3. Retrieve the chunks associated with those nodes.
+- **Why?** Solves "multi-hop" queries where the answer requires connecting disparate pieces of information across the document.
+### **D. Fusion & Reranking (`retrieval/fusion.py`)**
+- **RRF (Reciprocal Rank Fusion):** Combines the ranked lists from Milvus, BM25, and the Graph into one unified list.
+- **Cross-Encoder Reranker:** We take the top-15 fused candidates and run them through a Cross-Encoder (e.g., `BAAI/bge-reranker-base`).
+- **Why?** Cross-encoders are much more accurate (but slower) than vector search because they look at the query and chunk simultaneously. Using them as a final "filter" boosts precision significantly.
+---
+## 🧠 4. Generation Layer: "Corrective RAG (CRAG)"
+The `CRAGGate` (`generation/crag.py`) acts as a "quality control" layer between retrieval and the LLM.
+### **The CRAG Workflow**
+1. **Grading:** An LLM-as-judge assesses if the retrieved chunks are relevant to the query.
+2. **Action Categories:**
+    - **GOOD:** Chunks are relevant. Proceed to generation.
+    - **POOR:** Chunks are partially relevant. **Rewrite the query** (using CoT) and re-retrieve to find better results.
+    - **ABSENT:** Knowledge base doesn't have the answer. **Fallback to Web Search** (Tavily/DuckDuckGo).
+3. **LLM Generation:** Uses Groq (Llama 3), OpenAI, or NVIDIA NIM to generate the final answer with **inline citations** (e.g., "The sky is blue [1].").
+---
+## 📊 5. Evaluation Layer: "Reference-Free Metrics"
+Since production RAG systems often lack "ground truth" answers, we use the **RAGAS** framework (`evaluation/ragas_eval.py`).
+### **Key Metrics**
+- **Faithfulness:** Does the answer stay true to the retrieved context? (Prevents hallucinations).
+- **Answer Relevancy:** Does the answer actually address the user's question?
+- **Context Precision:** Were the retrieved chunks actually useful?
+- **Context Utilisation:** What % of retrieved chunks were actually cited?
+### **Implementation**
+- Evaluations run **asynchronously** in background threads so they don't slow down the user's response time.
+- Results are stored in a local SQLite DB for monitoring.
+---
+## 🛠️ 6. System & Infrastructure
+- **API:** FastAPI for high-performance, asynchronous endpoints.
+- **UI:** Streamlit for a clean, interactive dashboard (Ask, Ingest, Monitor).
+- **Cache:** Redis for caching query results (TTL-based) to save LLM costs and latency.
+- **Deployment:** Full **Docker Compose** setup for Milvus, Redis, API, and UI.
+---
+## 💡 Interview Tip: "Why this architecture?"
+If asked why you built it this way, emphasize these three points:
+1. **Precision:** By using **Semantic Chunking** and **Cross-Encoder Reranking**, we ensure only the most relevant context reaches the LLM.
+2. **Reliability:** **CRAG** ensures the system doesn't hallucinate when the knowledge base is missing information.
+3. **Observability:** By integrating **RAGAS**, we have an automated way to track performance and catch regressions.
+Good luck with your interview! 🚀

api/main.py CHANGED Viewed

@@ -32,6 +32,7 @@ from fastapi.middleware.cors import CORSMiddleware
 from fastapi.responses import StreamingResponse
 from api.schemas import (
     HealthResponse,
     IngestRequest,
     IngestResponse,
@@ -324,8 +325,25 @@ async def query(req: QueryRequest) -> QueryResponse:
     import time as _time
     _t0 = _time.perf_counter()
     try:
-        retrieval = _retriever.retrieve(req.query, top_k_candidates=k, final_top_k=cfg.final_top_k)
     except Exception as exc:
         logger.exception("Retrieval error")
         raise HTTPException(status_code=500, detail=f"Retrieval failed: {exc}")
@@ -359,7 +377,7 @@ async def query(req: QueryRequest) -> QueryResponse:
     try:
         result = _generator.generate(
             GenerationRequest(
-                query=req.query, chunks=final_chunks,
                 provider=llm_provider, model=llm_model,
                 api_key=llm_api_key,   base_url=llm_base_url,
             )
@@ -395,6 +413,7 @@ async def query(req: QueryRequest) -> QueryResponse:
     return QueryResponse(
         query=req.query,
         answer=result.answer,
         citations=[
             CitationResponse(
                 number=c.number,
@@ -436,11 +455,27 @@ async def query_stream(req: QueryRequest):
     cfg = get_settings()
     k = req.top_k or cfg.retrieval_top_k
     print(req)
     async def event_stream() -> AsyncGenerator[str, None]:
         try:
-            # 1. Retrieve
-            # 1. Multi-strategy retrieval: router → dense+BM25 → RRF → cross-encoder
-            result = _retriever.retrieve(req.query, top_k_candidates=k, final_top_k=cfg.final_top_k)
             final_chunks = result.chunks
             # 2. Emit chunk metadata + routing decision so UI shows sources + strategy info immediately
@@ -458,6 +493,7 @@ async def query_stream(req: QueryRequest):
             yield _sse_event({
                 "type": "chunk_meta",
                 "chunks": chunk_meta,
                 "routing": {
                     "intent": result.decision.intent.value,
                     "strategies": result.decision.strategies,
@@ -495,7 +531,8 @@ async def query_stream(req: QueryRequest):
             # 4. Stream answer tokens
             _llm = req.llm or {}
             gen_request = GenerationRequest(
-                query=req.query, chunks=final_chunks, stream=True,
                 provider=getattr(_llm, 'provider', None),
                 model=getattr(_llm, 'model', None),
                 api_key=getattr(_llm, 'api_key', None),

 from fastapi.responses import StreamingResponse
 from api.schemas import (
+    ConversationTurn,
     HealthResponse,
     IngestRequest,
     IngestResponse,
     import time as _time
     _t0 = _time.perf_counter()
+    # ── Short-term memory: rewrite ambiguous follow-ups ────
+    conversation = [{"role": t.role, "content": t.content} for t in req.conversation]
+    effective_query  = req.query
+    memory_rewritten = None
+    if conversation:
+        _llm_rw = req.llm or {}
+        _rewritten = _generator.rewrite_query(
+            query=req.query,
+            conversation=conversation,
+            provider=getattr(_llm_rw, 'provider', None),
+            model=getattr(_llm_rw, 'model', None),
+            api_key=getattr(_llm_rw, 'api_key', None),
+        )
+        if _rewritten != req.query:
+            effective_query  = _rewritten
+            memory_rewritten = _rewritten
     try:
+        retrieval = _retriever.retrieve(effective_query, top_k_candidates=k, final_top_k=cfg.final_top_k)
     except Exception as exc:
         logger.exception("Retrieval error")
         raise HTTPException(status_code=500, detail=f"Retrieval failed: {exc}")
     try:
         result = _generator.generate(
             GenerationRequest(
+                query=effective_query, chunks=final_chunks, conversation=conversation,
                 provider=llm_provider, model=llm_model,
                 api_key=llm_api_key,   base_url=llm_base_url,
             )
     return QueryResponse(
         query=req.query,
         answer=result.answer,
+        memory_rewritten_query=memory_rewritten,
         citations=[
             CitationResponse(
                 number=c.number,
     cfg = get_settings()
     k = req.top_k or cfg.retrieval_top_k
     print(req)
     async def event_stream() -> AsyncGenerator[str, None]:
         try:
+            # 1. Short-term memory: rewrite ambiguous follow-ups
+            _conv = [{"role": t.role, "content": t.content} for t in req.conversation]
+            _eff_query   = req.query
+            _mem_rewrite = None
+            if _conv:
+                _llm_rw2 = req.llm or {}
+                _rw2 = _generator.rewrite_query(
+                    query=req.query, conversation=_conv,
+                    provider=getattr(_llm_rw2, 'provider', None),
+                    model=getattr(_llm_rw2, 'model', None),
+                    api_key=getattr(_llm_rw2, 'api_key', None),
+                )
+                if _rw2 != req.query:
+                    _eff_query   = _rw2
+                    _mem_rewrite = _rw2
+            # 2. Multi-strategy retrieval: router → dense+BM25 → RRF → cross-encoder
+            result = _retriever.retrieve(_eff_query, top_k_candidates=k, final_top_k=cfg.final_top_k)
             final_chunks = result.chunks
             # 2. Emit chunk metadata + routing decision so UI shows sources + strategy info immediately
             yield _sse_event({
                 "type": "chunk_meta",
                 "chunks": chunk_meta,
+                "memory_rewritten_query": _mem_rewrite,
                 "routing": {
                     "intent": result.decision.intent.value,
                     "strategies": result.decision.strategies,
             # 4. Stream answer tokens
             _llm = req.llm or {}
             gen_request = GenerationRequest(
+                query=_eff_query, chunks=final_chunks, stream=True,
+                conversation=_conv,
                 provider=getattr(_llm, 'provider', None),
                 model=getattr(_llm, 'model', None),
                 api_key=getattr(_llm, 'api_key', None),

api/schemas.py CHANGED Viewed

@@ -13,6 +13,10 @@ class LLMConfig(BaseModel):
     api_key:  Optional[str] = Field(default=None, description="API key override for this request")
     base_url: Optional[str] = Field(default=None, description="Base URL (custom provider only)")
 class QueryRequest(BaseModel):
     query:  str = Field(..., min_length=3, max_length=2048, description="User question")
@@ -55,6 +59,7 @@ class QueryResponse(BaseModel):
     routing: Optional[RoutingResponse] = None
     crag_grade: Optional[str] = None
     crag_rewritten_query: Optional[str] = None
     web_search_used: bool = False
     model: str
     usage: dict

     api_key:  Optional[str] = Field(default=None, description="API key override for this request")
     base_url: Optional[str] = Field(default=None, description="Base URL (custom provider only)")
+class ConversationTurn(BaseModel):
+    """One turn of conversation history — sent from the UI for short-term memory."""
+    role:    str   # "user" | "assistant"
+    content: str   # raw text (no markdown HTML)
 class QueryRequest(BaseModel):
     query:  str = Field(..., min_length=3, max_length=2048, description="User question")
     routing: Optional[RoutingResponse] = None
     crag_grade: Optional[str] = None
     crag_rewritten_query: Optional[str] = None
+    memory_rewritten_query: Optional[str] = None  # set when rewritten for context resolution
     web_search_used: bool = False
     model: str
     usage: dict

generation/generator.py CHANGED Viewed

@@ -96,6 +96,8 @@ Rules you MUST follow:
    "I don't have sufficient information in the knowledge base to answer this."
 4. Keep your answer focused and precise. Use markdown formatting where helpful.
 5. At the end of your response, list the cited sources under a "## Sources" heading.
 """
 USER_PROMPT_TEMPLATE = """\
@@ -112,6 +114,21 @@ USER_PROMPT_TEMPLATE = """\
 Answer based strictly on the context passages above. Include inline [N] citations.
 """
 # ── Data classes ──────────────────────────────────────────────
@@ -120,8 +137,7 @@ class GenerationRequest:
     query:    str
     chunks:   list[RetrievedChunk]
     stream:   bool = True
-    # Runtime overrides — sent from the UI model selector
-    provider: Optional[str] = None   # e.g. "groq", "nvidia_nim", "openai", "custom"
     model:    Optional[str] = None   # model id string
     api_key:  Optional[str] = None   # override .env key for this request
     base_url: Optional[str] = None   # only used when provider == "custom"
@@ -156,6 +172,13 @@ class Generator:
     and cached in a small dict to avoid redundant instantiation across
     requests that share the same settings.
     Streaming example:
         gen = Generator()
         for token in gen.stream(GenerationRequest(query, chunks)):
@@ -197,14 +220,14 @@ class Generator:
         messages = self._build_messages(request)
-        stream = client.chat.completions.create(
             model=resolved["model"],
             messages=messages,
             temperature=resolved["temperature"],
             max_tokens=resolved["max_tokens"],
             stream=True,
         )
-        for chunk in stream:
             # Guard against empty choices — the final [DONE] sentinel chunk
             # from some providers (e.g. NVIDIA NIM) arrives as choices:[].
             if not chunk.choices:
@@ -214,6 +237,73 @@ class Generator:
             if delta and delta.content:
                 yield delta.content
     def build_sources_block(self, chunks: list[RetrievedChunk]) -> str:
         """
         Returns a markdown sources block for appending after the streamed answer.
@@ -299,6 +389,35 @@ class Generator:
     @staticmethod
     def _build_messages(request: GenerationRequest) -> list[dict]:
         context_parts: list[str] = []
         for i, chunk in enumerate(request.chunks, start=1):
             # Use parent_text for LLM context (wider context window),
@@ -313,10 +432,10 @@ class Generator:
             context=context_str,
             query=request.query,
         )
-        return [
-            {"role": "system", "content": SYSTEM_PROMPT},
-            {"role": "user",   "content": user_content},
-        ]
     @staticmethod
     def _build_citations(chunks: list[RetrievedChunk]) -> list[Citation]:

    "I don't have sufficient information in the knowledge base to answer this."
 4. Keep your answer focused and precise. Use markdown formatting where helpful.
 5. At the end of your response, list the cited sources under a "## Sources" heading.
+6. You have access to the conversation history above. Use it to resolve follow-up
+   references but always ground factual claims in the provided context passages.
 """
 USER_PROMPT_TEMPLATE = """\
 Answer based strictly on the context passages above. Include inline [N] citations.
 """
+REWRITE_PROMPT = """\
+You are a query rewriter for a retrieval system.
+Given a conversation history and a follow-up question, rewrite the follow-up as a \
+fully self-contained question that makes sense without the conversation history.
+Rules:
+- Resolve all pronouns (it, this, they, that, those, them) to their actual referents
+- Expand vague references like "the first one", "that paper", "the approach above"
+- If the question is already standalone and unambiguous, return it EXACTLY as-is
+- Return ONLY the rewritten question — no explanation, no preamble
+Conversation history:
+{history}
+Follow-up question: {query}"""
 # ── Data classes ──────────────────────────────────────────────
     query:    str
     chunks:   list[RetrievedChunk]
     stream:   bool = True
+    conversation: list[dict] = field(default_factory=list)  # [{role, content}, ...]    provider: Optional[str] = None   # e.g. "groq", "nvidia_nim", "openai", "custom"
     model:    Optional[str] = None   # model id string
     api_key:  Optional[str] = None   # override .env key for this request
     base_url: Optional[str] = None   # only used when provider == "custom"
     and cached in a small dict to avoid redundant instantiation across
     requests that share the same settings.
+    Memory is injected as prior conversation turns in the message list:
+    [system] → [user turn 1] → [assistant turn 1] → ... → [user + context]
+    The retrieval context (RAG passages) is attached only to the FINAL
+    user message. Prior turns are plain Q&A without context — the LLM
+    uses them purely to resolve pronouns and follow-up references.
     Streaming example:
         gen = Generator()
         for token in gen.stream(GenerationRequest(query, chunks)):
         messages = self._build_messages(request)
+        stream_obj = client.chat.completions.create(
             model=resolved["model"],
             messages=messages,
             temperature=resolved["temperature"],
             max_tokens=resolved["max_tokens"],
             stream=True,
         )
+        for chunk in stream_obj:
             # Guard against empty choices — the final [DONE] sentinel chunk
             # from some providers (e.g. NVIDIA NIM) arrives as choices:[].
             if not chunk.choices:
             if delta and delta.content:
                 yield delta.content
+    def rewrite_query(
+        self,
+        query: str,
+        conversation: list[dict],
+        provider: Optional[str] = None,
+        model:    Optional[str] = None,
+        api_key:  Optional[str] = None,
+    ) -> str:
+        """
+        Rewrite a follow-up query into a standalone question using conversation
+        history. Returns the original query unchanged if:
+          - There is no prior conversation (nothing to resolve)
+          - The rewrite call fails (safe fallback)
+          - The rewritten text is empty
+        Uses temperature=0 and max_tokens=200 — the cheapest possible call.
+        Example:
+            conversation = [
+                {"role": "user",      "content": "What is the attention mechanism?"},
+                {"role": "assistant", "content": "Attention allows the model to ..."},
+            ]
+            query = "Who invented it?"
+            → "Who invented the attention mechanism?"
+        """
+        if not conversation or len(conversation) < 2:
+            return query   # no history — nothing to resolve
+        # Build a compact history string from the last 4 turns (2 exchanges)
+        # to keep the rewrite prompt short and fast
+        recent = conversation[-4:]
+        history_str = "\n".join(
+            f"{t['role'].upper()}: {t['content'][:300]}"
+            for t in recent
+        )
+        prompt = REWRITE_PROMPT.format(history=history_str, query=query)
+        try:
+            # Build a minimal request just for the rewrite call
+            class _MinimalReq:
+                provider = provider
+                model    = model
+                api_key  = api_key
+                base_url = None
+            client, resolved = self._resolve_client(_MinimalReq())
+            response = client.chat.completions.create(
+                model=resolved["model"],
+                messages=[{"role": "user", "content": prompt}],
+                temperature=0.0,
+                max_tokens=200,
+                stream=False,
+            )
+            rewritten = (response.choices[0].message.content or "").strip()
+            if rewritten and rewritten != query:
+                logger.info(
+                    "Memory rewrite: '%s' → '%s'", query[:60], rewritten[:60]
+                )
+                return rewritten
+        except Exception as exc:
+            logger.debug("Query rewrite failed (%s) — using original query", exc)
+        return query
     def build_sources_block(self, chunks: list[RetrievedChunk]) -> str:
         """
         Returns a markdown sources block for appending after the streamed answer.
     @staticmethod
     def _build_messages(request: GenerationRequest) -> list[dict]:
+        """
+        Build the full message list for the LLM call.
+        Structure with conversation history:
+          [system]
+          [user: prior question 1]         ← conversation turns (no context)
+          [assistant: prior answer 1]
+          [user: prior question 2]
+          [assistant: prior answer 2]
+          ...
+          [user: current question + RAG context passages]
+        Without conversation history (or first turn):
+          [system]
+          [user: current question + RAG context passages]
+        The RAG context is ONLY attached to the final user message.
+        Prior turns are plain Q&A — they exist solely so the LLM can
+        resolve pronouns and follow-up references from prior exchanges.
+        """
+        messages: list[dict] = [{"role": "system", "content": SYSTEM_PROMPT}]
+        # Insert prior conversation turns (without context — plain Q&A)
+        for turn in request.conversation:
+            messages.append({"role": turn["role"], "content": turn["content"]})
+        # Final user message: current question + retrieved context
+        context_parts = []
         context_parts: list[str] = []
         for i, chunk in enumerate(request.chunks, start=1):
             # Use parent_text for LLM context (wider context window),
             context=context_str,
             query=request.query,
         )
+        messages.append({"role": "user", "content": user_content})
+        return messages
     @staticmethod
     def _build_citations(chunks: list[RetrievedChunk]) -> list[Citation]:

ui/static/script.js CHANGED Viewed

@@ -138,6 +138,7 @@ const streamStatus=document.getElementById('streamStatus');
 const sourcesList=document.getElementById('sourcesList');
 let isStreaming=false;
 let currentChunks=[];
 chatInput.addEventListener('input',()=>{
   chatInput.style.height='auto';
@@ -149,7 +150,7 @@ sendBtn.addEventListener('click',sendMessage);
 document.getElementById('clearChatBtn').addEventListener('click',()=>{
   chatMessages.innerHTML='<div class="message"><div class="msg-avatar ai">cx</div><div class="msg-body"><div class="msg-role">CORTEX</div><div class="msg-text">Cleared. Ask anything.</div></div></div>';
   sourcesList.innerHTML='<div class="empty-sources"><svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5"><path d="M9 12h6m-6 4h6m2 5H7a2 2 0 01-2-2V5a2 2 0 012-2h5.586a1 1 0 01.707.293l5.414 5.414a1 1 0 01.293.707V19a2 2 0 01-2 2z"/></svg><span>Retrieved passages will appear here</span></div>';
-  streamStatus.textContent='';currentChunks=[];
 });
 function renderSourceCards(chunks){
@@ -205,13 +206,22 @@ async function sendMessage(){
   streamStatus.textContent='…';
   try{
-    const resp=await fetch('/query/stream',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({query,top_k:10,stream:true,llm:{
-      provider: llmConfig.provider||null,
-      model:    llmConfig.model||null,
-      api_key:  llmConfig.api_key||null,
-      // Only send base_url for custom — server ignores it for known providers
-      base_url: (llmConfig.provider==='custom' && llmConfig.base_url) ? llmConfig.base_url : null,
-    }})});
     if(!resp.ok) throw new Error('HTTP '+resp.status);
     const reader=resp.body.getReader();
     const decoder=new TextDecoder();
@@ -232,6 +242,11 @@ async function sendMessage(){
           const routing=evt.routing||{};
           renderSourceCards(chunks);
           streamStatus.textContent='generating…';
           if(routing.intent) addBadge(liveBadges,routing.intent,'amber');
           (routing.strategies||[]).forEach(s=>addBadge(liveBadges,s.toUpperCase(),'blue'));
         }
@@ -242,6 +257,7 @@ async function sendMessage(){
           if(evt.rewritten_query) streamStatus.textContent='rewritten: "'+evt.rewritten_query.slice(0,50)+'…"';
         }
         else if(evt.type==='token'){
           const tok=evt.text||'';
           rawText+=tok;
           cursor.before(document.createTextNode(tok));
@@ -280,6 +296,13 @@ async function sendMessage(){
   }
   liveText.removeAttribute('id');liveBadges.removeAttribute('id');
   isStreaming=false;sendBtn.disabled=false;sendBtn.textContent='send';
   chatMessages.scrollTop=chatMessages.scrollHeight;
 }

 const sourcesList=document.getElementById('sourcesList');
 let isStreaming=false;
 let currentChunks=[];
+let chatHistory=[];   // [{role,content}] — short-term memory sent to API
 chatInput.addEventListener('input',()=>{
   chatInput.style.height='auto';
 document.getElementById('clearChatBtn').addEventListener('click',()=>{
   chatMessages.innerHTML='<div class="message"><div class="msg-avatar ai">cx</div><div class="msg-body"><div class="msg-role">CORTEX</div><div class="msg-text">Cleared. Ask anything.</div></div></div>';
   sourcesList.innerHTML='<div class="empty-sources"><svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5"><path d="M9 12h6m-6 4h6m2 5H7a2 2 0 01-2-2V5a2 2 0 012-2h5.586a1 1 0 01.707.293l5.414 5.414a1 1 0 01.293.707V19a2 2 0 01-2 2z"/></svg><span>Retrieved passages will appear here</span></div>';
+  streamStatus.textContent='';currentChunks=[];chatHistory=[];
 });
 function renderSourceCards(chunks){
   streamStatus.textContent='…';
   try{
+    // Send last 6 turns (3 exchanges) as short-term memory
+    const historyWindow=chatHistory.slice(-6);
+    const resp=await fetch('/query/stream',{
+      method:'POST',
+      headers:{'Content-Type':'application/json'},
+      body:JSON.stringify({
+        query, top_k:10, stream:true,
+        conversation: historyWindow,
+        llm:{
+          provider: llmConfig.provider||null,
+          model:    llmConfig.model||null,
+          api_key:  llmConfig.api_key||null,
+          base_url: (llmConfig.provider==='custom'&&llmConfig.base_url)?llmConfig.base_url:null,
+        }
+      })
+    });
     if(!resp.ok) throw new Error('HTTP '+resp.status);
     const reader=resp.body.getReader();
     const decoder=new TextDecoder();
           const routing=evt.routing||{};
           renderSourceCards(chunks);
           streamStatus.textContent='generating…';
+          // Show memory rewrite notification if query was rewritten
+          if(evt.memory_rewritten_query){
+            addBadge(liveBadges,'↺ context resolved','purple');
+            streamStatus.textContent='rewritten: "'+evt.memory_rewritten_query.slice(0,55)+'…"';
+          }
           if(routing.intent) addBadge(liveBadges,routing.intent,'amber');
           (routing.strategies||[]).forEach(s=>addBadge(liveBadges,s.toUpperCase(),'blue'));
         }
           if(evt.rewritten_query) streamStatus.textContent='rewritten: "'+evt.rewritten_query.slice(0,50)+'…"';
         }
         else if(evt.type==='token'){
+          // Append text node directly before cursor — true per-token streaming
           const tok=evt.text||'';
           rawText+=tok;
           cursor.before(document.createTextNode(tok));
   }
   liveText.removeAttribute('id');liveBadges.removeAttribute('id');
+  // Store this exchange in short-term memory (keep max 10 turns = 5 exchanges)
+  if(rawText){
+    chatHistory.push({role:'user',    content:query});
+    chatHistory.push({role:'assistant',content:rawText});
+    if(chatHistory.length>10) chatHistory=chatHistory.slice(-10);
+  }
   isStreaming=false;sendBtn.disabled=false;sendBtn.textContent='send';
   chatMessages.scrollTop=chatMessages.scrollHeight;
 }