File size: 4,170 Bytes
102dac3
249a397
102dac3
249a397
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102dac3
249a397
 
 
 
 
102dac3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# 🤖 Chatbot Architecture Overview: Krishna's Personal AI Assistant

This document details the architecture of **Krishna Vamsi Dhulipalla’s** personal AI assistant, implemented with **LangGraph** for orchestrated state management and tool execution. The system is designed for **retrieval-augmented, memory-grounded, and multi-turn conversational intelligence**, integrating **OpenAI GPT-4o**, **Hugging Face embeddings**, and **cross-encoder reranking**.

---

## 🧱 Core Components

### 1. **Models & Their Roles**

| Purpose                    | Model Name                               | Role Description                                 |
| -------------------------- | ---------------------------------------- | ------------------------------------------------ |
| **Main Chat Model**        | `gpt-4o`                                 | Handles conversation, tool calls, and reasoning  |
| **Retriever Embeddings**   | `sentence-transformers/all-MiniLM-L6-v2` | Embedding generation for FAISS vector search     |
| **Cross-Encoder Reranker** | `cross-encoder/ms-marco-MiniLM-L-6-v2`   | Reranks retrieval results for semantic relevance |
| **BM25 Retriever**         | (LangChain BM25Retriever)                | Keyword-based search complementing vector search |

All models are bound to LangGraph **StateGraph** nodes for structured execution.

---

## 🔍 Retrieval System

### ✅ **Hybrid Retrieval**

- **FAISS Vector Search** with normalized embeddings
- **BM25Retriever** for lexical keyword matching
- Combined using **Reciprocal Rank Fusion (RRF)**

### 📊 **Reranking & Diversity**

1. Initial retrieval with FAISS & BM25 (top-K per retriever)
2. Fusion via RRF scoring
3. **Cross-Encoder reranking** (top-N candidates)
4. **Maximal Marginal Relevance (MMR)** selection for diversity

### 🔎 Retriever Tool (`@tool retriever`)

- Returns top passages with minimal duplication
- Used in-system prompt to fetch accurate facts about Krishna

---

## 🧠 Memory System

### Long-Term Memory

- **FAISS-based memory vector store** stored at `backend/data/memory_faiss`
- Stores conversation summaries per thread ID

### Memory Search Tool (`@tool memory_search`)

- Retrieves relevant conversation snippets by semantic similarity
- Supports **thread-scoped** search for contextual continuity

### Memory Write Node

- After each AI response, stores `[Q]: ... [A]: ...` summary
- Autosaves after every `MEM_AUTOSAVE_EVERY` turns or on thread end

---

## 🧭 Orchestration Flow (LangGraph)

```mermaid
graph TD
    A[START] --> B[agent node]
    B -->|tool call| C[tools node]
    B -->|no tool| D[memory_write]
    C --> B
    D --> E[END]
```

### **Nodes**:

- **agent**: Calls main LLM with conversation window + system prompt
- **tools**: Executes retriever or memory search tools
- **memory_write**: Persists summaries to long-term memory

### **Conditional Edges**:

- From **agent** → `tools` if tool call detected
- From **agent** → `memory_write` if no tool call

---

## 💬 System Prompt

The assistant:

- Uses retriever and memory search tools to gather facts about Krishna
- Avoids fabrication and requests clarification when needed
- Responds humorously when off-topic but steers back to Krishna’s expertise
- Formats with Markdown, headings, and bullet points

Embedded **Krishna’s Bio** provides static grounding context.

---

## 🌐 API & Streaming

- **Backend**: FastAPI (`backend/api.py`)
  - `/chat` SSE endpoint streams tokens in real-time
  - Passes `thread_id` & `is_final` to LangGraph for stateful conversations
- **Frontend**: React + Tailwind (custom chat UI)
  - Threaded conversation storage in browser `localStorage`
  - Real-time token rendering via `EventSource`
  - Features: new chat, clear chat, delete thread, suggestions

---

## 🧩 Design Improvements

- **LangGraph StateGraph** ensures explicit control of message flow
- **Thread-scoped memory** enables multi-session personalization
- **Hybrid RRF + Cross-Encoder + MMR** retrieval pipeline improves relevance & diversity
- **SSE streaming** for low-latency feedback
- **Decoupled retrieval** and **memory** as separate tools for modularity