Enhance model card with comprehensive details, links, and usage
#2
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,4 +1,114 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: video-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- multimodal-llm
|
| 7 |
+
- video-understanding
|
| 8 |
+
- long-term-memory
|
| 9 |
---
|
| 10 |
+
|
| 11 |
+
# M3-Agent: Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
|
| 12 |
+
|
| 13 |
+
<div align=left>
|
| 14 |
+
<img src="https://github.com/user-attachments/assets/c42e675e-497c-4508-8bb9-093ad4d1f216" width=40%>
|
| 15 |
+
</div>
|
| 16 |
+
|
| 17 |
+
The model was presented in the paper [Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory](https://arxiv.org/abs/2508.09736).
|
| 18 |
+
|
| 19 |
+
- [\ud83c\udf10 Project Page](https://m3-agent.github.io)
|
| 20 |
+
- [\ud83d\udcbb GitHub Repository](https://github.com/ByteDance-Seed/M3-Agent)
|
| 21 |
+
- [\ud83c\udfa5 Demo Video](https://www.youtube.com/watch?v=XUx31cBanfo)
|
| 22 |
+
- [\ud83d\udcda M3-Bench Dataset](https://huggingface.co/datasets/ByteDance-Seed/M3-Bench)
|
| 23 |
+
|
| 24 |
+
## Abstract
|
| 25 |
+
We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at this https URL
|
| 26 |
+
|
| 27 |
+
## Model Details
|
| 28 |
+
This repository contains the **M3-Agent-Memorization** model, which is a component of the M3-Agent framework. It is designed to process real-time visual and auditory inputs to build and update long-term memory by generating memory graphs from video input.
|
| 29 |
+
|
| 30 |
+
## Usage
|
| 31 |
+
|
| 32 |
+
You can use this model with the `transformers` library. First, ensure you have the necessary dependencies installed. Note that `qwen-omni-utils` is a custom library used by this model for processing multimodal inputs.
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
# Install Hugging Face Transformers library (recommended version from GitHub README or compatible)
|
| 36 |
+
pip install transformers==4.51.0 # Or refer to the GitHub repo for the exact version
|
| 37 |
+
pip install qwen-omni-utils==0.0.4
|
| 38 |
+
pip install numpy==1.26.4
|
| 39 |
+
# vLLM is optional, used for faster inference during control phase
|
| 40 |
+
# pip install vllm==0.8.4
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
Here's a quick example to use the model for processing video input and generating text (Memorization phase):
|
| 44 |
+
|
| 45 |
+
```python
|
| 46 |
+
import torch
|
| 47 |
+
from transformers import AutoProcessor, AutoModelForConditionalGeneration
|
| 48 |
+
from qwen_omni_utils import process_vision_info # This function is from qwen_omni_utils
|
| 49 |
+
|
| 50 |
+
# Load model and processor
|
| 51 |
+
# Make sure to set trust_remote_code=True for custom architectures like Qwen2_5OmniThinker
|
| 52 |
+
model = AutoModelForConditionalGeneration.from_pretrained(
|
| 53 |
+
"ByteDance-Seed/M3-Agent-Memorization", # Replace with the actual model ID if different
|
| 54 |
+
torch_dtype="auto",
|
| 55 |
+
device_map="auto",
|
| 56 |
+
trust_remote_code=True
|
| 57 |
+
).eval()
|
| 58 |
+
processor = AutoProcessor.from_pretrained("ByteDance-Seed/M3-Agent-Memorization", trust_remote_code=True)
|
| 59 |
+
|
| 60 |
+
# Example: Process a video input to generate descriptive text
|
| 61 |
+
messages = [
|
| 62 |
+
{
|
| 63 |
+
"role": "user",
|
| 64 |
+
"content": [
|
| 65 |
+
{
|
| 66 |
+
"type": "video",
|
| 67 |
+
"video": "./path/to/your/video.mp4", # Replace with the actual path to your video file
|
| 68 |
+
},
|
| 69 |
+
{"type": "text", "text": "Describe this video in detail."},
|
| 70 |
+
],
|
| 71 |
+
}
|
| 72 |
+
]
|
| 73 |
+
|
| 74 |
+
# Prepare inputs for the model
|
| 75 |
+
text = processor.apply_chat_template(
|
| 76 |
+
messages, tokenize=False, add_generation_prompt=True
|
| 77 |
+
)
|
| 78 |
+
# process_vision_info is crucial for handling video/image inputs
|
| 79 |
+
image_inputs, video_inputs = process_vision_info(messages)
|
| 80 |
+
inputs = processor(
|
| 81 |
+
text=[text],
|
| 82 |
+
images=image_inputs,
|
| 83 |
+
videos=video_inputs,
|
| 84 |
+
padding=True,
|
| 85 |
+
return_tensors="pt",
|
| 86 |
+
)
|
| 87 |
+
inputs = inputs.to("cuda") # Ensure model and inputs are on the same device (e.g., "cuda" or "cpu")
|
| 88 |
+
|
| 89 |
+
# Generate output
|
| 90 |
+
generated_ids = model.generate(**inputs, max_new_tokens=512)
|
| 91 |
+
|
| 92 |
+
generated_ids_trimmed = [
|
| 93 |
+
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
| 94 |
+
]
|
| 95 |
+
|
| 96 |
+
output_text = processor.batch_decode(
|
| 97 |
+
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
| 98 |
+
)[0]
|
| 99 |
+
print(output_text)
|
| 100 |
+
```
|
| 101 |
+
*Note*: The full M3-Agent framework involves complex data preparation (e.g., intermediate outputs for face detection, speaker diarization) and memory graph generation. For comprehensive instructions on setting up and running the complete system, please refer to the [official GitHub repository](https://github.com/ByteDance-Seed/M3-Agent).
|
| 102 |
+
|
| 103 |
+
## Citation
|
| 104 |
+
|
| 105 |
+
```bibtex
|
| 106 |
+
@misc{long2025seeing,
|
| 107 |
+
title={Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory},
|
| 108 |
+
author={Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li},
|
| 109 |
+
year={2025},
|
| 110 |
+
eprint={2508.09736},
|
| 111 |
+
archivePrefix={arXiv},
|
| 112 |
+
primaryClass={cs.CV}
|
| 113 |
+
}
|
| 114 |
+
```
|