Enhance model card with comprehensive details, links, and usage

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +111 -1
README.md CHANGED
@@ -1,4 +1,114 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
3
  ---
4
- paper link: https://arxiv.org/abs/2508.09736
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: video-text-to-text
4
+ library_name: transformers
5
+ tags:
6
+ - multimodal-llm
7
+ - video-understanding
8
+ - long-term-memory
9
  ---
10
+
11
+ # M3-Agent: Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
12
+
13
+ <div align=left>
14
+ <img src="https://github.com/user-attachments/assets/c42e675e-497c-4508-8bb9-093ad4d1f216" width=40%>
15
+ </div>
16
+
17
+ The model was presented in the paper [Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory](https://arxiv.org/abs/2508.09736).
18
+
19
+ - [\ud83c\udf10 Project Page](https://m3-agent.github.io)
20
+ - [\ud83d\udcbb GitHub Repository](https://github.com/ByteDance-Seed/M3-Agent)
21
+ - [\ud83c\udfa5 Demo Video](https://www.youtube.com/watch?v=XUx31cBanfo)
22
+ - [\ud83d\udcda M3-Bench Dataset](https://huggingface.co/datasets/ByteDance-Seed/M3-Bench)
23
+
24
+ ## Abstract
25
+ We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at this https URL
26
+
27
+ ## Model Details
28
+ This repository contains the **M3-Agent-Memorization** model, which is a component of the M3-Agent framework. It is designed to process real-time visual and auditory inputs to build and update long-term memory by generating memory graphs from video input.
29
+
30
+ ## Usage
31
+
32
+ You can use this model with the `transformers` library. First, ensure you have the necessary dependencies installed. Note that `qwen-omni-utils` is a custom library used by this model for processing multimodal inputs.
33
+
34
+ ```bash
35
+ # Install Hugging Face Transformers library (recommended version from GitHub README or compatible)
36
+ pip install transformers==4.51.0 # Or refer to the GitHub repo for the exact version
37
+ pip install qwen-omni-utils==0.0.4
38
+ pip install numpy==1.26.4
39
+ # vLLM is optional, used for faster inference during control phase
40
+ # pip install vllm==0.8.4
41
+ ```
42
+
43
+ Here's a quick example to use the model for processing video input and generating text (Memorization phase):
44
+
45
+ ```python
46
+ import torch
47
+ from transformers import AutoProcessor, AutoModelForConditionalGeneration
48
+ from qwen_omni_utils import process_vision_info # This function is from qwen_omni_utils
49
+
50
+ # Load model and processor
51
+ # Make sure to set trust_remote_code=True for custom architectures like Qwen2_5OmniThinker
52
+ model = AutoModelForConditionalGeneration.from_pretrained(
53
+ "ByteDance-Seed/M3-Agent-Memorization", # Replace with the actual model ID if different
54
+ torch_dtype="auto",
55
+ device_map="auto",
56
+ trust_remote_code=True
57
+ ).eval()
58
+ processor = AutoProcessor.from_pretrained("ByteDance-Seed/M3-Agent-Memorization", trust_remote_code=True)
59
+
60
+ # Example: Process a video input to generate descriptive text
61
+ messages = [
62
+ {
63
+ "role": "user",
64
+ "content": [
65
+ {
66
+ "type": "video",
67
+ "video": "./path/to/your/video.mp4", # Replace with the actual path to your video file
68
+ },
69
+ {"type": "text", "text": "Describe this video in detail."},
70
+ ],
71
+ }
72
+ ]
73
+
74
+ # Prepare inputs for the model
75
+ text = processor.apply_chat_template(
76
+ messages, tokenize=False, add_generation_prompt=True
77
+ )
78
+ # process_vision_info is crucial for handling video/image inputs
79
+ image_inputs, video_inputs = process_vision_info(messages)
80
+ inputs = processor(
81
+ text=[text],
82
+ images=image_inputs,
83
+ videos=video_inputs,
84
+ padding=True,
85
+ return_tensors="pt",
86
+ )
87
+ inputs = inputs.to("cuda") # Ensure model and inputs are on the same device (e.g., "cuda" or "cpu")
88
+
89
+ # Generate output
90
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
91
+
92
+ generated_ids_trimmed = [
93
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
94
+ ]
95
+
96
+ output_text = processor.batch_decode(
97
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
98
+ )[0]
99
+ print(output_text)
100
+ ```
101
+ *Note*: The full M3-Agent framework involves complex data preparation (e.g., intermediate outputs for face detection, speaker diarization) and memory graph generation. For comprehensive instructions on setting up and running the complete system, please refer to the [official GitHub repository](https://github.com/ByteDance-Seed/M3-Agent).
102
+
103
+ ## Citation
104
+
105
+ ```bibtex
106
+ @misc{long2025seeing,
107
+ title={Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory},
108
+ author={Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li},
109
+ year={2025},
110
+ eprint={2508.09736},
111
+ archivePrefix={arXiv},
112
+ primaryClass={cs.CV}
113
+ }
114
+ ```