| --- |
| tags: |
| - ml-intern |
| --- |
| # On-Device Memory Improvements for LLM Chat |
|
|
| Drop-in improvements for on-device chat apps using small language models with semantic memory (e.g., Llama-3.2-3B + EmbeddingGemma-300M + SQLite vector store). |
|
|
| ## π¦ Files |
|
|
| | File | Purpose | |
| |------|---------| |
| | `src/types.ts` | Type definitions & configuration | |
| | `src/deduplication.ts` | **Improvement 1:** Mem0-style ADD/UPDATE/NOOP deduplication | |
| | `src/memoryDecay.ts` | **Improvement 2:** Heat-based decay & eviction (MemoryOS) | |
| | `src/typedMemory.ts` | **Improvement 3:** ENGRAM-style episodic/semantic/procedural router | |
| | `src/assistantMemory.ts` | **Improvement 4:** Store AI replies as memories | |
| | `src/smartFilter.ts` | **Improvement 5:** Only store declarative/factual content | |
| | `src/dynamicRetrieval.ts` | **Improvement 6:** Adaptive top-k, threshold, type filtering | |
| | `src/schema.ts` | SQLite schema migration | |
| | `src/index.ts` | `ImprovedMemoryService` class (combines all improvements) | |
| | `src/integrationGuide.ts` | Exact code changes for each file in your app | |
| | `src/beforeAfterWalkthrough.ts` | Side-by-side comparison of old vs new behavior | |
| | `tests/demo.py` | Validated Python demo showing all improvements | |
|
|
| ## π Quick Start |
|
|
| ```typescript |
| import { ImprovedMemoryService, DEFAULT_MEMORY_CONFIG } from './src'; |
| |
| const memoryService = new ImprovedMemoryService(db, embedFn, DEFAULT_MEMORY_CONFIG); |
| |
| // Retrieve memories (replaces your current searchMemories) |
| const { context, memoryIds } = await memoryService.retrieve(userMessage, embedding); |
| |
| // Store user message (with smart filtering + dedup) |
| await memoryService.storeUserMessage(userMessage, embedding); |
| |
| // Store assistant reply (new!) |
| await memoryService.storeAssistantReply(assistantReply); |
| ``` |
|
|
| ## π Before vs After |
|
|
| | Aspect | Before | After | |
| |--------|--------|-------| |
| | Memories stored | 200+ (noisy) | ~80 (clean, deduplicated) | |
| | Context slots | 5 (3 wasted on duplicates) | 4-12 (all unique, relevant) | |
| | Token budget | Fixed | Dynamic (100-500 based on query) | |
| | Prompt structure | Flat list | Typed sections | |
| | Assistant recall | None | Tracks recommendations/promises | |
| | Stale memory handling | Never cleaned | Heat-based eviction | |
| | Duplicate handling | None | 0.85 threshold auto-dedup | |
|
|
| ## π Research Basis |
|
|
| - **ENGRAM** (arxiv 2511.12960) β Typed memory stores with router |
| - **Mem0** (arxiv 2504.19413) β ADD/UPDATE/DELETE operations with deduplication |
| - **MemoryOS** (arxiv 2506.06326) β Heat-based decay and hierarchical storage |
| - **MemLoRA** (arxiv 2512.04763) β Distilled memory adapters for small models |
|
|
| <!-- ml-intern-provenance --> |
| ## Generated by ML Intern |
|
|
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. |
|
|
| - Try ML Intern: https://smolagents-ml-intern.hf.space |
| - Source code: https://github.com/huggingface/ml-intern |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = "loudiman/memory-improvements" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id) |
| ``` |
|
|
| For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class. |
|
|