Spaces:
Running
Running
Conversation Memory System
Purpose
Long-running chats can easily exceed the model context window. MiniSearch addresses this by keeping a rolling, extractive summary of prior turns and only feeding the freshest messages into the model alongside that summary. All context handling happens locally in the browser to preserve privacy. @client/modules/textGeneration.ts#262-370
Components
- Token Budgeting –
generateChatResponsemeasures the system prompt and a stub "Ok!" assistant reply, then caps the rest of the user/assistant turns at 75% of the default 4096-token window (≈3072 tokens) to leave headroom for the response. A GPT tokenizer keeps count per message before inclusion. @client/modules/textGeneration.ts#262-303 @client/modules/textGenerationUtilities.ts#13-74 - Rolling Summary Storage – The latest summary plus a conversation identifier live in a lightweight pub/sub store so any component can read/write without prop drilling. @client/modules/pubSub.ts#249-268
- Summarization Engine – When older turns must be dropped,
createLlmSummaryasks the configured inference backend (OpenAI, AI Horde, internal API, WebLLM, or Wllama) to condense the removed messages under an 800-token limit. If the LLM call fails, the system falls back to an extractive tokenizer-based summarizer to guarantee progress. @client/modules/textGeneration.ts#66-177 - Persistence Hooks – After a search run completes,
saveLlmResponseForQuerystores the assistant reply in IndexedDB so history restores can reload it. The conversation summary itself stays in-memory and resets whenever a new search run begins. @client/modules/history.ts#288-333 @client/modules/textGeneration.ts#179-247
Flow
- User sends a chat message.
- System prompt is regenerated by
getSystemPromptand augmented with any stored summary (Conversation context: ...). @client/modules/textGeneration.ts#270-329 - Recent turns are appended until the budget is exhausted; older ones become "dropped messages".
- Dropped messages are summarized and the digest is saved back to the pub/sub store with the current conversation ID. @client/modules/textGeneration.ts#313-330
- The final prompt sent to the model always starts with the refreshed system prompt, followed by the stub assistant reply and the kept turn list to encourage immediate streaming.
Settings & Extensibility
- All inference types share the same summarization contract—no provider-specific logic beyond selecting the backend module at runtime. @client/modules/textGeneration.ts#95-135
- Changing the global context window (e.g., via OpenAI settings) automatically affects the available budget because the logic derives from the default context size exported by
textGenerationUtilities. @client/modules/textGenerationUtilities.ts#13-74 - Future settings (e.g., toggling memory or adjusting the 75% ratio) should hook into the same budgeting helpers to keep behavior predictable.
Failure Modes & Logging
- Every summarization attempt is wrapped in try/catch; failures emit
addLogEntrynotifications and fall back to extractive summaries so the chat loop never stalls. @client/modules/textGeneration.ts#97-138 - If generation is interrupted (user stop), a custom
ChatGenerationErrorensures the loop exits gracefully without corrupting the stored summary. @client/modules/textGeneration.ts#360-369 @client/modules/textGenerationUtilities.ts#19-26
Reset Rules
- Starting a new top-level search clears the summary, chat history, and cached results to avoid context leakage across unrelated conversations. @client/modules/textGeneration.ts#179-207
- Restoring a run from history repopulates chat state from IndexedDB; the memory system will rebuild summaries on demand once the user resumes chatting. @client/modules/history.ts#335-365 @client/hooks/useHistoryRestore.ts#32-105
Related Topics
- AI Integration:
docs/ai-integration.md- Detailed inference options - Search History:
docs/search-history.md- History and persistence - Overview:
docs/overview.md- System architecture - Configuration:
docs/configuration.md- Settings for context window