Papers
arxiv:2603.01990

According to Me: Long-Term Personalized Referential Memory QA

Published on Mar 2
· Submitted by
Mei
on Mar 12
Authors:
,
,
,
,
,

Abstract

ATM-Bench presents the first multimodal, multi-source personalized referential memory benchmark with privacy-preserving data and human-annotated QA pairs requiring complex reasoning across different memory types.

AI-generated summary

Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20\% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: https://github.com/JingbiaoMei/ATM-Bench

Community

ATM-Bench: The First Benchmark for Multimodal, Multi-Source Personalized Referential Memory QA

Current AI assistants excel at short conversations but fail at long-term personalization—they cannot resolve queries like "Show me the gift I bought for mum during our trip to Japan" that require connecting implicit, user-specific context across years of scattered memories. Existing benchmarks only test dialogue history, leaving a critical gap in evaluating real-world personal memory capabilities.

We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential memory QA. ATM-Bench captures ~4 years of real-life personal memory spanning images, videos, and emails, with 1,038 human-annotated QA pairs and ground-truth evidence for rigorous evaluation.
ATM-Bench Task Example

Key Contributions

  1. Schema-Guided Memory (SGM)
    We propose a structured memory representation that consistently outperforms the descriptive memory paradigm used in prior work, achieving ~20% improvement on the hard set. SGM organizes memories with typed schemas rather than flat descriptions, enabling better cross-source reasoning.

Schema-Guided Memory Method
2. Exposing the Real Bottleneck
Five state-of-the-art memory systems all score under 20% on ATM-Bench-Hard. Even frontier models (GPT-5, Claude Opus 4.5) with oracle retrieval show significant drops on the hard set. This reveals that the bottleneck is reasoning and aggregation, not merely retrieval—challenging the prevailing assumption that better retrieval alone will solve long-term memory.

  1. Coding Agents as Strong Baselines
    We benchmark general-purpose coding agents (Claude Code, Codex, OpenCode) against specialized memory systems. While they outperform specialized systems, they still struggle (best Codex 39.7%), confirming that ATM-Bench represents a wide-open challenge requiring novel approaches.

Dataset Highlights

Feature Details
Memory span ~4 years of real personal data
Modalities Images, videos, emails
QA pairs 1,069 human-annotated
Evidence Ground-truth retrieval annotations
Difficulty Easy / Hard splits for progressive evaluation

Resources

• 📄 Paper: arXiv:2603.01990 (https://arxiv.org/abs/2603.01990)
• 🌐 Project Page: https://atmbench.github.io/
• 💻 GitHub: https://github.com/JingbiaoMei/ATM-Bench
• 🤗 Dataset: https://huggingface.co/datasets/Jingbiao/ATM-Bench

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.01990 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.01990 in a Space README.md to link it from this page.

Collections including this paper 1