# 🤖 Chatbot Architecture Overview: Krishna's Personal AI Assistant This document details the architecture of **Krishna Vamsi Dhulipalla’s** personal AI assistant, implemented with **LangGraph** for orchestrated state management and tool execution. The system is designed for **retrieval-augmented, memory-grounded, and multi-turn conversational intelligence**, integrating **OpenAI GPT-4o**, **Hugging Face embeddings**, and **cross-encoder reranking**. --- ## 🧱 Core Components ### 1. **Models & Their Roles** | Purpose | Model Name | Role Description | | -------------------------- | ---------------------------------------- | ------------------------------------------------ | | **Main Chat Model** | `gpt-4o` | Handles conversation, tool calls, and reasoning | | **Retriever Embeddings** | `sentence-transformers/all-MiniLM-L6-v2` | Embedding generation for FAISS vector search | | **Cross-Encoder Reranker** | `cross-encoder/ms-marco-MiniLM-L-6-v2` | Reranks retrieval results for semantic relevance | | **BM25 Retriever** | (LangChain BM25Retriever) | Keyword-based search complementing vector search | All models are bound to LangGraph **StateGraph** nodes for structured execution. --- ## 🔍 Retrieval System ### ✅ **Hybrid Retrieval** - **FAISS Vector Search** with normalized embeddings - **BM25Retriever** for lexical keyword matching - Combined using **Reciprocal Rank Fusion (RRF)** ### 📊 **Reranking & Diversity** 1. Initial retrieval with FAISS & BM25 (top-K per retriever) 2. Fusion via RRF scoring 3. **Cross-Encoder reranking** (top-N candidates) 4. **Maximal Marginal Relevance (MMR)** selection for diversity ### 🔎 Retriever Tool (`@tool retriever`) - Returns top passages with minimal duplication - Used in-system prompt to fetch accurate facts about Krishna --- ## 🧠 Memory System ### Long-Term Memory - **FAISS-based memory vector store** stored at `backend/data/memory_faiss` - Stores conversation summaries per thread ID ### Memory Search Tool (`@tool memory_search`) - Retrieves relevant conversation snippets by semantic similarity - Supports **thread-scoped** search for contextual continuity ### Memory Write Node - After each AI response, stores `[Q]: ... [A]: ...` summary - Autosaves after every `MEM_AUTOSAVE_EVERY` turns or on thread end --- ## 🧭 Orchestration Flow (LangGraph) ```mermaid graph TD A[START] --> B[agent node] B -->|tool call| C[tools node] B -->|no tool| D[memory_write] C --> B D --> E[END] ``` ### **Nodes**: - **agent**: Calls main LLM with conversation window + system prompt - **tools**: Executes retriever or memory search tools - **memory_write**: Persists summaries to long-term memory ### **Conditional Edges**: - From **agent** → `tools` if tool call detected - From **agent** → `memory_write` if no tool call --- ## 💬 System Prompt The assistant: - Uses retriever and memory search tools to gather facts about Krishna - Avoids fabrication and requests clarification when needed - Responds humorously when off-topic but steers back to Krishna’s expertise - Formats with Markdown, headings, and bullet points Embedded **Krishna’s Bio** provides static grounding context. --- ## 🌐 API & Streaming - **Backend**: FastAPI (`backend/api.py`) - `/chat` SSE endpoint streams tokens in real-time - Passes `thread_id` & `is_final` to LangGraph for stateful conversations - **Frontend**: React + Tailwind (custom chat UI) - Threaded conversation storage in browser `localStorage` - Real-time token rendering via `EventSource` - Features: new chat, clear chat, delete thread, suggestions --- ## 🧩 Design Improvements - **LangGraph StateGraph** ensures explicit control of message flow - **Thread-scoped memory** enables multi-session personalization - **Hybrid RRF + Cross-Encoder + MMR** retrieval pipeline improves relevance & diversity - **SSE streaming** for low-latency feedback - **Decoupled retrieval** and **memory** as separate tools for modularity