MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism
Abstract
MemDreamer addresses long-video understanding challenges by decoupling perception and reasoning through hierarchical graph memory and agentic exploration, achieving state-of-the-art performance with reduced computational overhead.
Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.
Community
MemDreamer decouples perception and reasoning of long-video understanding via Hierachical Graph Memory and Agentic retrieval mechanism. This paradigm bypasses context limits and
mitigates attention dilution, offering a promising scaling direction for future multimodal comprehension.
We warmly welcome feedback, comments, and constructive criticism from the community.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios (2026)
- MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models (2026)
- VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis (2026)
- GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning (2026)
- OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning (2026)
- PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning (2026)
- HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper