nielsr's picture
nielsr HF Staff
Enhance model card with comprehensive details, links, and usage
48fb607 verified
|
raw
history blame
5.97 kB
metadata
license: apache-2.0
pipeline_tag: video-text-to-text
library_name: transformers
tags:
  - multimodal-llm
  - video-understanding
  - long-term-memory

M3-Agent: Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

The model was presented in the paper Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory.

Abstract

We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at this https URL

Model Details

This repository contains the M3-Agent-Memorization model, which is a component of the M3-Agent framework. It is designed to process real-time visual and auditory inputs to build and update long-term memory by generating memory graphs from video input.

Usage

You can use this model with the transformers library. First, ensure you have the necessary dependencies installed. Note that qwen-omni-utils is a custom library used by this model for processing multimodal inputs.

# Install Hugging Face Transformers library (recommended version from GitHub README or compatible)
pip install transformers==4.51.0 # Or refer to the GitHub repo for the exact version
pip install qwen-omni-utils==0.0.4
pip install numpy==1.26.4
# vLLM is optional, used for faster inference during control phase
# pip install vllm==0.8.4

Here's a quick example to use the model for processing video input and generating text (Memorization phase):

import torch
from transformers import AutoProcessor, AutoModelForConditionalGeneration
from qwen_omni_utils import process_vision_info # This function is from qwen_omni_utils

# Load model and processor
# Make sure to set trust_remote_code=True for custom architectures like Qwen2_5OmniThinker
model = AutoModelForConditionalGeneration.from_pretrained(
    "ByteDance-Seed/M3-Agent-Memorization", # Replace with the actual model ID if different
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
).eval()
processor = AutoProcessor.from_pretrained("ByteDance-Seed/M3-Agent-Memorization", trust_remote_code=True)

# Example: Process a video input to generate descriptive text
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "./path/to/your/video.mp4", # Replace with the actual path to your video file
            },
            {"type": "text", "text": "Describe this video in detail."},
        ],
    }
]

# Prepare inputs for the model
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
# process_vision_info is crucial for handling video/image inputs
image_inputs, video_inputs = process_vision_info(messages) 
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda") # Ensure model and inputs are on the same device (e.g., "cuda" or "cpu")

# Generate output
generated_ids = model.generate(**inputs, max_new_tokens=512) 

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)

Note: The full M3-Agent framework involves complex data preparation (e.g., intermediate outputs for face detection, speaker diarization) and memory graph generation. For comprehensive instructions on setting up and running the complete system, please refer to the official GitHub repository.

Citation

@misc{long2025seeing,
      title={Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory}, 
      author={Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li},
      year={2025},
      eprint={2508.09736},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}