According to Me: Long-Term Personalized Referential Memory QA
Abstract
ATM-Bench presents the first multimodal, multi-source personalized referential memory benchmark with privacy-preserving data and human-annotated QA pairs requiring complex reasoning across different memory types.
Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20\% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: https://github.com/JingbiaoMei/ATM-Bench
Community
ATM-Bench: The First Benchmark for Multimodal, Multi-Source Personalized Referential Memory QA
Current AI assistants excel at short conversations but fail at long-term personalization—they cannot resolve queries like "Show me the gift I bought for mum during our trip to Japan" that require connecting implicit, user-specific context across years of scattered memories. Existing benchmarks only test dialogue history, leaving a critical gap in evaluating real-world personal memory capabilities.
We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential memory QA. ATM-Bench captures ~4 years of real-life personal memory spanning images, videos, and emails, with 1,038 human-annotated QA pairs and ground-truth evidence for rigorous evaluation.
Key Contributions
- Schema-Guided Memory (SGM)
We propose a structured memory representation that consistently outperforms the descriptive memory paradigm used in prior work, achieving ~20% improvement on the hard set. SGM organizes memories with typed schemas rather than flat descriptions, enabling better cross-source reasoning.

2. Exposing the Real Bottleneck
Five state-of-the-art memory systems all score under 20% on ATM-Bench-Hard. Even frontier models (GPT-5, Claude Opus 4.5) with oracle retrieval show significant drops on the hard set. This reveals that the bottleneck is reasoning and aggregation, not merely retrieval—challenging the prevailing assumption that better retrieval alone will solve long-term memory.
- Coding Agents as Strong Baselines
We benchmark general-purpose coding agents (Claude Code, Codex, OpenCode) against specialized memory systems. While they outperform specialized systems, they still struggle (best Codex 39.7%), confirming that ATM-Bench represents a wide-open challenge requiring novel approaches.
Dataset Highlights
| Feature | Details |
|---|---|
| Memory span | ~4 years of real personal data |
| Modalities | Images, videos, emails |
| QA pairs | 1,069 human-annotated |
| Evidence | Ground-truth retrieval annotations |
| Difficulty | Easy / Hard splits for progressive evaluation |
Resources
• 📄 Paper: arXiv:2603.01990 (https://arxiv.org/abs/2603.01990)
• 🌐 Project Page: https://atmbench.github.io/
• 💻 GitHub: https://github.com/JingbiaoMei/ATM-Bench
• 🤗 Dataset: https://huggingface.co/datasets/Jingbiao/ATM-Bench
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- M2A: Multimodal Memory Agent with Dual-Layer Hybrid Memory for Long-Term Personalized Interactions (2026)
- MemWeaver: Weaving Hybrid Memories for Traceable Long-Horizon Agentic Reasoning (2026)
- Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents (2026)
- EMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents (2026)
- HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling (2026)
- ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents (2026)
- OP-Bench: Benchmarking Over-Personalization for Memory-Augmented Personalized Conversational Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper