Safetensors
qwen3

Improve model card: Add metadata, links, usage, and visuals

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +83 -1
README.md CHANGED
@@ -1,4 +1,86 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
4
- paper link: https://arxiv.org/abs/2508.09736
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: video-text-to-text
4
+ library_name: transformers
5
  ---
6
+
7
+ # Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
8
+
9
+ This repository contains a model checkpoint from the M3-Agent framework, which was introduced in the paper [Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory](https://arxiv.org/abs/2508.09736).
10
+
11
+ - 🌐 [**Project Page**](https://m3-agent.github.io)
12
+ - 💻 [**GitHub Repository**](https://github.com/hyc2026/M3-Agent-Training)
13
+
14
+ <div align=left>
15
+ <img src="https://github.com/user-attachments/assets/c42e675e-497c-4508-8bb9-093ad4d1f216" width=40%>
16
+ </div>
17
+
18
+ ## Abstract
19
+ We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at this https URL
20
+
21
+ ## M3-Agent Demo Video
22
+ Explore M3-Agent's capabilities as a personal assistant in this demo video:
23
+
24
+ [![Watch the video](https://github.com/ByteDance-Seed/M3-Agent/raw/main/figs/demo.png)](https://www.youtube.com/watch?v=XUx31cBanfo)
25
+
26
+ ## Usage
27
+
28
+ This model is designed to be a component of the larger M3-Agent framework, typically used for tasks related to memory generation or control. It can be loaded using the Hugging Face `transformers` library.
29
+
30
+ For detailed usage within the full M3-Agent pipeline, including processing video and audio inputs to generate memory graphs and performing reasoning tasks, please refer to the [official GitHub repository](https://github.com/hyc2026/M3-Agent-Training).
31
+
32
+ ```python
33
+ import torch
34
+ from transformers import AutoModelForCausalLM, AutoTokenizer
35
+
36
+ # Assuming this repository contains the 'M3-Agent-Memorization' or a similar Qwen-based checkpoint
37
+ # If your repository has a different ID, replace it below.
38
+ model_id = "ByteDance-Seed/M3-Agent-Memorization" # This might need to be adjusted to the actual repo ID
39
+
40
+ # Load tokenizer and model
41
+ # trust_remote_code=True is often required for custom architectures like Qwen3
42
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
43
+ model = AutoModelForCausalLM.from_pretrained(
44
+ model_id,
45
+ torch_dtype=torch.bfloat16, # Use torch.float16 if bfloat16 is not supported by your hardware
46
+ device_map="auto",
47
+ trust_remote_code=True
48
+ )
49
+
50
+ # Set model to evaluation mode
51
+ model.eval()
52
+
53
+ print(f"Model {model_id} loaded successfully.")
54
+ print("This model is a component of the M3-Agent framework. For full agent pipeline usage,")
55
+ print("including video processing for memory generation and control tasks,")
56
+ print("please refer to the official GitHub repository: https://github.com/hyc2026/M3-Agent-Training")
57
+
58
+ # Example for basic text generation (demonstrates model loading)
59
+ # The actual use case for this model is within the M3-Agent pipeline,
60
+ # involving multimodal inputs and structured outputs.
61
+ # messages = [
62
+ # {"role": "user", "content": "Hello! How are you today?"},
63
+ # ]
64
+ # text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
65
+ # input_ids = tokenizer(text, return_tensors="pt").to(model.device)
66
+ #
67
+ # with torch.inference_mode():
68
+ # outputs = model.generate(input_ids, max_new_tokens=50, do_sample=True, temperature=0.7)
69
+ # generated_text = tokenizer.decode(outputs[0][len(input_ids[0]):], skip_special_tokens=True)
70
+ # print(f"
71
+ Generated response (general text generation): {generated_text}")
72
+ ```
73
+
74
+ ## Citation
75
+ If you find this model or the M3-Agent project helpful, please cite the following paper:
76
+
77
+ ```bibtex
78
+ @misc{long2025seeing,
79
+ title={Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory},
80
+ author={Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li},
81
+ year={2025},
82
+ eprint={2508.09736},
83
+ archivePrefix={arXiv},
84
+ primaryClass={cs.CV}
85
+ }
86
+ ```