On-Device Memory Improvements for LLM Chat
Drop-in improvements for on-device chat apps using small language models with semantic memory (e.g., Llama-3.2-3B + EmbeddingGemma-300M + SQLite vector store).
π¦ Files
| File | Purpose |
|---|---|
src/types.ts |
Type definitions & configuration |
src/deduplication.ts |
Improvement 1: Mem0-style ADD/UPDATE/NOOP deduplication |
src/memoryDecay.ts |
Improvement 2: Heat-based decay & eviction (MemoryOS) |
src/typedMemory.ts |
Improvement 3: ENGRAM-style episodic/semantic/procedural router |
src/assistantMemory.ts |
Improvement 4: Store AI replies as memories |
src/smartFilter.ts |
Improvement 5: Only store declarative/factual content |
src/dynamicRetrieval.ts |
Improvement 6: Adaptive top-k, threshold, type filtering |
src/schema.ts |
SQLite schema migration |
src/index.ts |
ImprovedMemoryService class (combines all improvements) |
src/integrationGuide.ts |
Exact code changes for each file in your app |
src/beforeAfterWalkthrough.ts |
Side-by-side comparison of old vs new behavior |
tests/demo.py |
Validated Python demo showing all improvements |
π Quick Start
import { ImprovedMemoryService, DEFAULT_MEMORY_CONFIG } from './src';
const memoryService = new ImprovedMemoryService(db, embedFn, DEFAULT_MEMORY_CONFIG);
// Retrieve memories (replaces your current searchMemories)
const { context, memoryIds } = await memoryService.retrieve(userMessage, embedding);
// Store user message (with smart filtering + dedup)
await memoryService.storeUserMessage(userMessage, embedding);
// Store assistant reply (new!)
await memoryService.storeAssistantReply(assistantReply);
π Before vs After
| Aspect | Before | After |
|---|---|---|
| Memories stored | 200+ (noisy) | ~80 (clean, deduplicated) |
| Context slots | 5 (3 wasted on duplicates) | 4-12 (all unique, relevant) |
| Token budget | Fixed | Dynamic (100-500 based on query) |
| Prompt structure | Flat list | Typed sections |
| Assistant recall | None | Tracks recommendations/promises |
| Stale memory handling | Never cleaned | Heat-based eviction |
| Duplicate handling | None | 0.85 threshold auto-dedup |
π Research Basis
- ENGRAM (arxiv 2511.12960) β Typed memory stores with router
- Mem0 (arxiv 2504.19413) β ADD/UPDATE/DELETE operations with deduplication
- MemoryOS (arxiv 2506.06326) β Heat-based decay and hierarchical storage
- MemLoRA (arxiv 2512.04763) β Distilled memory adapters for small models
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "loudiman/memory-improvements"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support