Files changed (2) hide show
  1. README.md +12 -15
  2. data_summary_card.md +0 -146
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  library_name: transformers
 
3
  license: mit
4
- pipeline_tag: robotics
5
  ---
6
 
7
  # Model Card for Magma-8B
@@ -180,8 +180,7 @@ image = image.convert("RGB")
180
 
181
  convs = [
182
  {"role": "system", "content": "You are agent that can see, talk and act."},
183
- {"role": "user", "content": "<image_start><image><image_end>
184
- What is in this image?"},
185
  ]
186
  prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
187
  inputs = processor(images=[image], texts=prompt, return_tensors="pt")
@@ -223,7 +222,7 @@ Our training data consists of:
223
 
224
  * Robotics Manipulation Data: [Open-X-Embodiment](https://robotics-transformer-x.github.io/).
225
 
226
- * UI Grounding Data: [SeeClick](https://github.com/njucckevin/SeeClick).\
227
 
228
  * UI Navigation Data: [Mind2web](https://osu-nlp-group.github.io/Mind2Web/) and [AITW](https://github.com/google-research/google-research/tree/master/android_in_the_wild).
229
 
@@ -474,16 +473,14 @@ For the robotic manipulation task, some mitigation strategies to use for human s
474
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
475
 
476
  ```bibtex
477
- @misc{yang2025magmafoundationmodelmultimodal,\
478
- title={Magma: A Foundation Model for Multimodal AI Agents}, \
479
- author={Jianwei Yang and Reuben Tan and Qianhui Wu and Ruijie Zheng and Baolin Peng and Yongyuan Liang and Yu Gu and Mu Cai and Seonghyeon Ye and Joel Jang and Yuquan Deng and Lars Liden and Jianfeng Gao},\
480
- year={2025},\
481
- eprint={2502.13130},\
482
- archivePrefix={arXiv},\
483
- url={https://arxiv.org/abs/2502.13130}, \
 
484
  }
485
  ```
486
- <!-- {{ citation_bibtex | default("[More Information Needed]", true)}} -->
487
-
488
- ## Data Summary
489
- https://huggingface.co/microsoft/Magma-8B/blob/main/data_summary_card.md
 
1
  ---
2
  library_name: transformers
3
+ pipeline_tag: image-text-to-text
4
  license: mit
 
5
  ---
6
 
7
  # Model Card for Magma-8B
 
180
 
181
  convs = [
182
  {"role": "system", "content": "You are agent that can see, talk and act."},
183
+ {"role": "user", "content": "<image_start><image><image_end>\nWhat is in this image?"},
 
184
  ]
185
  prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
186
  inputs = processor(images=[image], texts=prompt, return_tensors="pt")
 
222
 
223
  * Robotics Manipulation Data: [Open-X-Embodiment](https://robotics-transformer-x.github.io/).
224
 
225
+ * UI Grounding Data: [SeeClick](https://github.com/njucckevin/SeeClick).
226
 
227
  * UI Navigation Data: [Mind2web](https://osu-nlp-group.github.io/Mind2Web/) and [AITW](https://github.com/google-research/google-research/tree/master/android_in_the_wild).
228
 
 
473
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
474
 
475
  ```bibtex
476
+ @misc{yang2025magmafoundationmodelmultimodal,
477
+ title={Magma: A Foundation Model for Multimodal AI Agents},
478
+ author={Jianwei Yang and Reuben Tan and Qianhui Wu and Ruijie Zheng and Baolin Peng and Yongyuan Liang and Yu Gu and Mu Cai and Seonghyeon Ye and Joel Jang and Yuquan Deng and Lars Liden and Jianfeng Gao},
479
+ year={2025},
480
+ eprint={2502.13130},
481
+ archivePrefix={arXiv},
482
+ primaryClass={cs.CV},
483
+ url={https://arxiv.org/abs/2502.13130},
484
  }
485
  ```
486
+ <!-- {{ citation_bibtex | default("[More Information Needed]", true)}} -->
 
 
 
data_summary_card.md DELETED
@@ -1,146 +0,0 @@
1
-
2
-
3
- # Data Summary for Magma 8B
4
-
5
-
6
-
7
-
8
-
9
- ## 1. General information
10
-
11
- **1.0.1 Version of the Summary:** 1.0
12
-
13
-
14
-
15
- **1.0.2 Last update:** 24-Nov-2025
16
-
17
-
18
-
19
- ## 1.1 Model Developer Identification
20
-
21
- **1.1.1 Model Developer name and contact details:** Microsoft Corporation at One Microsoft Way, Redmond, WA 98052. Tel: 425-882-8080
22
-
23
-
24
-
25
- ## 1.2 Model Identification
26
-
27
- **1.2.1 Versioned model name(s):** Magma-8B
28
-
29
-
30
-
31
- **1.2.2 Model release date:** 19-Feb-2025
32
-
33
-
34
-
35
- ## 1.3 Overall training data size and characteristics
36
-
37
- ### 1.3.1 Size of dataset and characteristics
38
-
39
- **1.3.1.A Text training data size:** Less than 1 billion tokens
40
-
41
-
42
-
43
- **1.3.1.B Text training data content:** Image captions, Conversational Dialogs, Text instructions for tasks.
44
-
45
-
46
-
47
- **1.3.1.C Image training data size:** 1 billion to 10 trillion tokens
48
-
49
-
50
-
51
- **1.3.1.D Image training data content:** Training included multimodal image datasets and UI screenshots for grounding and navigation such as ShareGPT4V, LLaVA-1.5 instruction data, InfoGraphicVQA, ChartQA, FigureQA, TQA, ScienceQA, SeeClick and Vision2UI; images cover photography, charts, figures, documents, infographics, and interface elements
52
-
53
-
54
-
55
- **1.3.1.E Audio training data size:** Not applicable. Audio data is not part of the training data
56
-
57
-
58
- **1.3.1.F Audio training data content:** Not applicable
59
-
60
-
61
-
62
- **1.3.1.G Video training data size:** Less than 1 billion tokens
63
-
64
-
65
-
66
- **1.3.1.H Video training data content:** Instructional and egocentric videos used for agentic pretraining and temporal grounding, including Epic-Kitchens, Ego4D, Something-Something v2 and other instructional clips; videos were segmented and filtered, and used to derive Trace-of-Mark trajectories for action planning
67
-
68
-
69
-
70
- **1.3.1.I Other training data size:** Robotics data comprising approximately 9.4 million image-language-action triplets from around 326,000 trajectories within Open-X-Embodiment mixtures
71
-
72
-
73
-
74
- **1.3.1.J Other training data content:** Robotics manipulation datasets from Open-X-Embodiment used for vision-language-action learning, including 7-DoF gripper states and visual traces to support action prediction
75
-
76
-
77
-
78
- **1.3.2 Latest date of data acquisition/collection for model training:** 11-Jan-2024
79
-
80
-
81
-
82
- **1.3.3 Is data collection ongoing to update the model with new data collection after deployment?** No
83
-
84
-
85
-
86
- **1.3.4 Date the training dataset was first used to train the model:** 8-Jan-2024
87
-
88
-
89
-
90
- **1.3.5 Rationale or purpose of data selection:** Datasets were selected to cover multimodal understanding and agentic capabilities across digital and physical environments. UI datasets provide actionable elements for grounding and navigation; instructional videos supply rich temporal dynamics for action planning; robotics datasets provide action trajectories for manipulation; and multimodal image instruction data maintains general visual-language competence. This mix supports spatial-temporal reasoning, grounding, and planning
91
-
92
-
93
-
94
- ## 2. List of data sources
95
-
96
- ### 2.1 Publicly available datasets
97
-
98
- **2.1.1 Have you used publicly available datasets to train the model?** Yes
99
-
100
-
101
-
102
- ## 2.2 Private non-publicly available datasets obtained from third parties
103
-
104
- ### 2.2.1 Datasets commercially licensed by rights holders or their representatives
105
-
106
- **2.2.1.A Have you concluded transactional commercial licensing agreement(s) with rights holder(s) or with their representatives?** No
107
-
108
-
109
-
110
- ### 2.2.2 Private datasets obtained from other third-parties
111
-
112
- **2.2.2.A Have you obtained private datasets from third parties that are not licensed as described in Section 2.2.1, such as data obtained from providers of private databases, or data intermediaries?** No
113
-
114
-
115
-
116
- ## 2.3 Personal Information
117
-
118
- **2.3.1 Was personal data used to train the model?** Microsoft follows all relevant laws and regulations pertaining to personal information.
119
-
120
-
121
-
122
- ## 2.4 Synthetic data
123
-
124
- **2.4.1 Was any synthetic AI-generated data used to train the model?** Yes
125
-
126
-
127
-
128
- ## 3. Data processing aspects
129
-
130
- ### 3.1 Respect of reservation of rights from text and data mining exception or limitation
131
-
132
- **3.1.1 Does this dataset include any data protected by copyright, trademark, or patent?** Microsoft follows all required regulations and laws for processing data protected by copyright, trademark, or patent.
133
-
134
-
135
-
136
- ## 3.2 Other information
137
-
138
- **3.2.1 Does the dataset include information about consumer groups without revealing individual consumer identities?** Microsoft follows all required regulations and laws for protecting consumer identities.
139
-
140
-
141
-
142
- **3.2.2 Was the dataset cleaned or modified before model training?** Yes
143
-
144
-
145
-
146
-