Spaces:
Sleeping
Sleeping
| {"id": "mem-001", "title": "Memory-GPT: Episodic Retrieval for Long-Context Agents", "category": "Architecture", "summary": "Proposes a two-tier memory architecture where short-term working memory is backed by a compressed episodic store. Retrieval is performed via learned dense embeddings over event summaries, achieving 34% F1 improvement on multi-hop QA tasks spanning >100k tokens.", "tags": ["episodic-memory", "retrieval", "long-context", "transformer"], "source_url": "https://arxiv.org/abs/2403.14231", "safety_notes": "No identified risks. Benchmarked on public academic datasets only.", "provenance": "ICLR 2024 workshop submission, peer-reviewed. Code and data available under MIT license.", "benchmark_scores": {"multi-hop-f1": 0.72, "retrieval-recall@10": 0.89, "latency-ms": 120}} | |
| {"id": "mem-002", "title": "Reflexion: Self-Reflective Agents with Verbal Reinforcement", "category": "Self-Improvement", "summary": "Introduces a loop where an LLM agent reflects on task failure in natural language, writes the reflection to memory, and uses it to improve future trials. Demonstrates pass@1 gains of 17% on programming benchmarks by accumulating trial-and-error experience.", "tags": ["reflection", "self-improvement", "verbal-reinforcement", "programming"], "source_url": "https://arxiv.org/abs/2303.11366", "safety_notes": "Reflections may encode spurious correlations from small sample sizes. Authors recommend minimum 10 trials before trusting memory.", "provenance": "NeurIPS 2023. Reproduced by 4 independent groups (see citations).", "benchmark_scores": {"pass@1": 0.91, "trial-count": 12, "reflection-accuracy": 0.84}} | |
| {"id": "mem-003", "title": "Generative Agents: Interactive Simulacra of Human Behavior", "category": "Simulation", "summary": "Architects agent societies where each agent maintains a memory stream of observations, reflections, and plans. Memory retrieval uses a composite score of recency, importance, and relevance. Enables emergent social dynamics in sandbox environments.", "tags": ["simulation", "society", "retrieval", "emergent-behavior"], "source_url": "https://arxiv.org/abs/2304.03442", "safety_notes": "Simulated agents can exhibit stereotyping based on training data bias. Authors provide demographic audit in appendix.", "provenance": "UIST 2023 Best Paper. Open-source reimplementation available (not official).", "benchmark_scores": {"believability-likert": 4.2, "social-density": 25, "runtime-hrs": 48}} | |
| {"id": "mem-004", "title": "MemGPT: Towards LLMs as Operating Systems", "category": "Architecture", "summary": "Frames LLM context as virtual memory with OS-inspired paging between context window (RAM) and external storage (disk). Implements interrupt-driven retrieval and hierarchical storage tiers. Extends effective context by 10x on long-document QA.", "tags": ["virtual-memory", "os-abstraction", "paging", "context-management"], "source_url": "https://arxiv.org/abs/2310.08560", "safety_notes": "Interrupts can leak private context into retrieval queries if memory boundaries are misconfigured.", "provenance": "Preprint, under review at OSDI. Code released under Apache-2.0.", "benchmark_scores": {"long-doc-qa": 0.68, "context-extension-factor": 10, "interrupt-latency-ms": 45}} | |
| {"id": "mem-005", "title": "Voyager: An Open-Ended Embodied Agent with Large Language Models", "category": "Embodied", "summary": "Minecraft agent that writes executable skill libraries to memory as code. Skills are indexed by embedding and retrieved by situational description. Achieves 15x more unique items discovered than prior RL agents through lifelong learning.", "tags": ["embodied", "skill-library", "code-generation", "lifelong-learning"], "source_url": "https://arxiv.org/abs/2305.16291", "safety_notes": "Generated code is sandboxed but may contain infinite loops or unsafe system calls if prompt engineering fails.", "provenance": "NeurIPS 2023. Official reproduction achieves similar results; unofficial varies widely.", "benchmark_scores": {"unique-items": 164, "skill-count": 87, "success-rate": 0.73}} | |
| {"id": "mem-006", "title": "RETRO: Improving Language Models by Retrieving from Trillions of Tokens", "category": "Retrieval-Augmentation", "summary": "Decoder-only model that retrieves chunks from a 2 trillion token database to augment context during generation. Retrieval is performed at every layer via cross-attention, reducing perplexity by 10% with 25x fewer parameters than GPT-3.", "tags": ["retrieval-augmentation", "scale", "decoder", "cross-attention"], "source_url": "https://arxiv.org/abs/2112.04426", "safety_notes": "Database may retrieve copyrighted or toxic text. Authors implement DMCA-compliant filtering pipeline.", "provenance": "NeurIPS 2022. DeepMind internal reproduction confirmed. Open reimplementations lag by ~3% perplexity.", "benchmark_scores": {"perplexity": 10.2, "retrieval-db-tokens": 2e12, "parameter-reduction": 25}} | |
| {"id": "mem-007", "title": "Synaptic Intelligence: Continual Learning in Neural Networks", "category": "Continual Learning", "summary": "Prevents catastrophic forgetting by tracking parameter importance during training and penalizing changes to critical weights. Enables sequential task learning without replay buffers, achieving 95% backward transfer accuracy on split CIFAR-100.", "tags": ["continual-learning", "catastrophic-forgetting", "parameter-importance", "regularization"], "source_url": "https://arxiv.org/abs/1703.04200", "safety_notes": "Importance estimates can be brittle under adversarial task orderings. Recommended to validate on held-out task sequences.", "provenance": "ICML 2017. Widely reproduced; 200+ citations. Reference implementation in PyTorch available.", "benchmark_scores": {"backward-transfer": 0.95, "forward-transfer": 0.12, "task-count": 20}} | |
| {"id": "mem-008", "title": "MemoryBank: Enhancing LLM Memory with Long-Term Storage", "category": "Architecture", "summary": "External key-value memory module for LLMs that stores factual associations beyond parametric knowledge. Updates are batched and consistency is maintained via a write-ahead log. Improves factual recall by 22% on TriviaQA over 6 months of simulated time.", "tags": ["external-memory", "key-value-store", "factual-recall", "consistency"], "source_url": "https://arxiv.org/abs/2401.04723", "safety_notes": "Write-ahead log may retain sensitive facts if not purged. GDPR deletion requires explicit tombstoning.", "provenance": "ACL 2024 Findings. Dataset includes synthetic PII for testing; handle with care.", "benchmark_scores": {"triviaqa-em": 0.78, "recall-decay-6mo": 0.04, "update-throughput": 1500}} | |
| {"id": "mem-009", "title": "Chain of Hindsight: Learning from Suboptimal Demonstrations", "category": "Self-Improvement", "summary": "Agent learns by contrasting good and bad trajectories stored in hindsight memory. Each memory entry includes the full episode plus a scalar hindsight score. Outperforms PPO on decision-making tasks with 30% fewer environment interactions.", "tags": ["hindsight", "rlhf", "trajectory-memory", "sample-efficiency"], "source_url": "https://arxiv.org/abs/2402.12138", "safety_notes": "Bad trajectories may contain unsafe actions; filtering heuristics are task-dependent.", "provenance": "Preprint. Reproduced on 3 RL benchmarks by independent lab; variance ±2%.", "benchmark_scores": {"ppo-baseline": 0.62, "coh-score": 0.81, "interaction-reduction": 0.30}} | |
| {"id": "mem-010", "title": "HippoCampus: A Biologically Inspired Memory Model for Agents", "category": "Biological Inspiration", "summary": "Models pattern separation and completion after the dentate gyrus and CA3 regions. Agents show improved one-shot association learning and graceful degradation under memory corruption. Tested on navigation and object-relation tasks.", "tags": ["biological-inspiration", "pattern-separation", "one-shot", "graceful-degradation"], "source_url": "https://arxiv.org/abs/2311.08952", "safety_notes": "Biological fidelity trades off against computational cost (5x slower inference than MLP baseline).", "provenance": "COSYNE 2024. Limited reproducibility package; contact authors for full code.", "benchmark_scores": {"one-shot-accuracy": 0.88, "corruption-robustness": 0.72, "inference-slowdown": 5.0}} | |
| {"id": "mem-011", "title": "Toolformer: Language Models Can Teach Themselves to Use Tools", "category": "Tool Use", "summary": "LLM learns to decide when and how to call external APIs by sampling candidate tool calls, executing them, and keeping only those that reduce loss. Tool-use decisions are stored as soft memory patterns in the model weights.", "tags": ["tool-use", "api-calling", "self-supervised", "efficiency"], "source_url": "https://arxiv.org/abs/2302.04761", "safety_notes": "Unsupervised tool selection may call live APIs during training, incurring costs and side effects. Sandbox execution mandatory.", "provenance": "Meta AI research, not peer-reviewed at time of release. Multiple open reimplementations exist.", "benchmark_scores": {"math-accuracy": 0.74, "api-call-efficiency": 0.63, "training-cost-usd": 450}} | |
| {"id": "mem-012", "title": "LangChain Memory Modules: A Practitioner Survey", "category": "Survey", "summary": "Comprehensive survey of memory abstractions in the LangChain ecosystem. Categorizes implementations into buffer, vector, entity, and summary memory. Finds that 68% of production agents use simple buffer memory despite theoretical advantages of structured stores.", "tags": ["survey", "practitioner", "langchain", "buffer-memory"], "source_url": "https://huggingface.co/papers/2405.08711", "safety_notes": "Survey relies on self-reported data; actual production usage may differ from GitHub telemetry.", "provenance": "HF Papers, 2024. Community-driven; no formal peer review.", "benchmark_scores": {"buffer-usage-pct": 68, "vector-usage-pct": 22, "entity-usage-pct": 10}} | |