Papers
arxiv:2605.14906

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

Published on May 14
· Submitted by
Zhaowei Wang
on May 15
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A new benchmark evaluates memory capabilities in vision-language models through multi-session conversations, revealing limitations of both long-context and memory-augmented approaches.

AI-generated summary

Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.

Community

We have open-sourced our code and dataset. Please check out our GitHub repository:
https://github.com/xrenaf/MEMLENS

Paper submitter

We would greatly appreciate it if you could upvote.

This is a timely and valuable benchmark. I really like that MEMLENS focuses on multimodal memory across multi-session conversations, rather than only long text context. The comparison between long-context LVLMs and memory-augmented agents is also very meaningful, and the image-ablation results clearly show that visual evidence is truly necessary.

I’m very interested in this work. One small question: How are the key evidence images distributed across sessions—are they concentrated in a few sessions or intentionally scattered throughout the conversation?

·

Thank you for your question. We maintain a uniform image-to-text token ratio to prevent images from being overly concentrated in a small number of sessions. More details about this design are provided in our paper. This procedure helps avoid potential shortcuts caused by image concentration and contributes to a more coherent and balanced dataset.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.14906 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.14906 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.14906 in a Space README.md to link it from this page.

Collections including this paper 1