binarykoder youliangt commited on
Commit
fd0ed6b
·
0 Parent(s):

Duplicate from nvidia/GR00T-N1.7-3B

Browse files

Co-authored-by: You Liang Tan <youliangt@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
EXPLAINABILITY.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # **Explainability**
2
+
3
+ |Field:|Response:|
4
+ |:---:|:---:|
5
+ |Intended Domain:| Open foundation model for generalized humanoid robot reasoning and skills.|
6
+ |Model Type: |Robot VLA model|
7
+ |Intended Users:|This model is intended for developers and community that build and finetune robot foundation models.|
8
+ |Output:|The model outputs are actions, and the units are floating points. This is referred to as "robot action policy." Actions consist of continuous-value vectors that correspond to different motor controls on a robot.|
9
+ |Describe how the model works:|Accepts vision, language and proprioception, outputs robot action policy.|
10
+ |Technical Limitations & Mitigation:| This model is not tested or intended for use in mission critical applications that require functional safety. The use of the model in those applications is at the user's own risk and sole responsibility, including taking the necessary steps to add needed guardrails or safety mechanisms prior to deployment.<br><br>Risk: Model underperformance in highly dynamic environments with varying robot surroundings (e.g. furniture, objects, etc) and lighting conditions.<br>Mitigation: Enhance dataset with dynamic obstacle scenarios and fine-tune models accordingly.<br><br>Risk: Integration challenges in specific customer environments with varying robot surroundings (e.g. furniture, objects, etc) and lighting conditions.<br>Mitigation: Provide detailed integration guides and support, leveraging NVIDIA's ecosystem.<br><br>Risk: Limited initial support for certain robot embodiments.<br>Mitigation: Expand testing and validation across a wider range of robot platforms.|
11
+ |Verified to have met prescribed quality standards?|Yes|
12
+ |Performance Metrics:|Success rate, as well as the following:<br>1) if the trajectory is smooth and does not jitter<br>2) if the robot does not hit any other objects<br>3) if the trajectory is natural|
13
+ |Potential Known Risks:|This model is not tested or intended for use in mission critical applications that require functional safety. The use of the model in those applications is at the user's own risk and sole responsibility, including taking the necessary steps to add needed guardrails or safety mechanisms prior to deployment.|
14
+ |End User License Agreement:| Your use of this model is governed by the [NSCL V1 License](https://developer.download.nvidia.com/licenses/NVIDIA-OneWay-Noncommercial-License-22Mar2022.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczovL3d3dy5nb29nbGUuY29tLyIsIm5jaWQiOiJzby15b3V0LTg3MTcwMS12dDQ4In0=).|
LICENSE ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ NVIDIA License
2
+ 1. Definitions
3
+ “Licensor” means any person or entity that distributes its Work.
4
+ “Work” means (a) the original work of authorship made available under this license,
5
+ which may include software, documentation, or other files, and (b) any additions to or
6
+ derivative works thereof that are made available under this license.
7
+ The terms “reproduce,” “reproduction,” “derivative works,” and “distribution” have the
8
+ meaning as provided under U.S. copyright law; provided, however, that for the purposes
9
+ of this license, derivative works shall not include works that remain separable from, or
10
+ merely link (or bind by name) to the interfaces of, the Work.
11
+ Works are “made available” under this license by including in or with the Work either (a)
12
+ a copyright notice referencing the applicability of this license to the Work, or (b) a copy
13
+ of this license.
14
+ 2. License Grant
15
+ 2.1 Copyright Grant. Subject to the terms and conditions of this license, each
16
+ Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free,
17
+ copyright license to use, reproduce, prepare derivative works of, publicly display,
18
+ publicly perform, sublicense and distribute its Work and any resulting derivative
19
+ works in any form.
20
+ 3. Limitations
21
+ 3.1 Redistribution. You may reproduce or distribute the Work only if (a) you do so
22
+ under this license, (b) you include a complete copy of this license with your
23
+ distribution, and (c) you retain without modification any copyright, patent,
24
+ trademark, or attribution notices that are present in the Work.
25
+ 3.2 Derivative Works. You may specify that additional or different terms apply to
26
+ the use, reproduction, and distribution of your derivative works of the Work (“Your
27
+ Terms”) only if (a) Your Terms provide that the use limitation in Section 3.3
28
+ applies to your derivative works, and (b) you identify the specific derivative works
29
+ that are subject to Your Terms. Notwithstanding Your Terms, this license (including
30
+ the redistribution requirements in Section 3.1) will continue to apply to the Work
31
+ itself.
32
+ 3.3 Use Limitation. The Work and any derivative works thereof only may be used
33
+ or intended for use non-commercially. Notwithstanding the foregoing, NVIDIA
34
+ Corporation and its affiliates may use the Work and any derivative works
35
+ commercially. As used herein, “non-commercially” means for research or
36
+ evaluation purposes only.
37
+ 3.4 Patent Claims. If you bring or threaten to bring a patent claim against any
38
+ Licensor (including any claim, cross-claim or counterclaim in a lawsuit) to enforce
39
+ any patents that you allege are infringed by any Work, then your rights under this
40
+ license from such Licensor (including the grant in Section 2.1) will terminate
41
+ immediately.
42
+ 3.5 Trademarks. This license does not grant any rights to use any Licensor’s or its
43
+ affiliates’ names, logos, or trademarks, except as necessary to reproduce the
44
+ notices described in this license.
45
+ 3.6 Termination. If you violate any term of this license, then your rights under this
46
+ license (including the grant in Section 2.1) will terminate immediately.
47
+ 4. Disclaimer of Warranty.
48
+ THE WORK IS PROVIDED “AS IS” WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
49
+ EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF
50
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-
51
+ INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER THIS
52
+ LICENSE.
53
+ 5. Limitation of Liability.
54
+ EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL
55
+ THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE
56
+ SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT,
57
+ INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR
58
+ RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING BUT
59
+ NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR
60
+ DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER DAMAGES OR LOSSES),
61
+ EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
PRIVACY.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # **Privacy**
2
+
3
+ |Field:|Response:|
4
+ |:---:|:---:|
5
+ |Generatable or reverse engineerable personal data?|None|
6
+ |Personal data used to create this model?|No|
7
+ |How often is dataset reviewed?|Before Release|
8
+ |Is there provenance for all datasets used in training?|Yes|
9
+ |Does data labeling (annotation, metadata) comply with privacy laws?|Yes|
10
+ |Is data compliant with data subject requests for data correction or removal, if such a request was made?|Yes|
11
+ |Applicable NVIDIA Privacy Policy|https://www.nvidia.com/en-us/about-nvidia/privacy-policy/|
README.md ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - robotics
4
+ ---
5
+
6
+ <div align="center">
7
+ <a href="https://github.com/NVIDIA/Isaac-GR00T">
8
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/67b8da81d01134f89899b4a7/8bFQa2ZIGCsOQQ2ho2N_U.png">
9
+ </a>
10
+ <div align="center">
11
+ <a href="https://github.com/NVIDIA/Isaac-GR00T">
12
+ <img src="https://img.shields.io/badge/GitHub-grey?logo=GitHub" alt="GitHub Badge">
13
+ </a>
14
+ <a href="https://developer.nvidia.com/isaac/gr00t">
15
+ <img src="https://img.shields.io/badge/Website-green" alt="Website Badge">
16
+ </a>
17
+ <!-- <a href=""">
18
+ <img src="https://img.shields.io/badge/Project%20Page-blue?style=plastic" alt="Project Page Badge">
19
+ </a>
20
+ <a href="">
21
+ <img src="https://img.shields.io/badge/Research_Blog-black?style=flat" alt="Research Blog Badge">
22
+ </a>
23
+ <a href="">
24
+ <img src="https://img.shields.io/badge/Dataset-Overview-brightgreen?logo=googleforms" alt="Research Blog Badge">
25
+ </a>
26
+ -->
27
+ </div>
28
+ </div>
29
+
30
+ # Model Overview
31
+
32
+ <p align="center">
33
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/67b8da81d01134f89899b4a7/ZCLLXZk2LQBG0YH_BmiIN.gif"
34
+ style="width:100%; max-width:1000px; height:auto;">
35
+ </p>
36
+
37
+ ## Description:
38
+ NVIDIA Isaac GR00T N1.7 is an open foundation model for generalized humanoid robot reasoning and skills. This cross-embodiment model takes multimodal input, including language and images, to perform manipulation tasks in diverse environments. Developers and researchers can post-train GR00T N1.7 with real or synthetic data for their specific humanoid robot or task.
39
+
40
+ Isaac GR00T N1.7 is the medium-sized version of our model built using pre-trained vision and language encoders, and uses a flow matching action transformer to model a chunk of actions conditioned on vision, language and proprioception.
41
+
42
+ A detailed description of the Isaac GR00T N1.X architecture is provided in the GROOT N1 White Paper (https://arxiv.org/abs/2503.14734).
43
+
44
+ This model is ready for commercial/non-commercial use.
45
+
46
+ **Model Developer**: NVIDIA
47
+
48
+ ## Model Versions
49
+ The Isaac GR00T N1.7 model family includes the following 4 models:
50
+
51
+ ### GR00T N1.7 – SimplerEnv Bridge
52
+
53
+ **Description**
54
+ N1.7 post-trained model using the **Bridge Dataset** in SimplerEnv.
55
+
56
+ **Post-Training Data**
57
+ https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot
58
+
59
+ **Dataset Summary**
60
+ A LeRobot-format conversion of **BridgeData V2**, originally containing **60,096 trajectories** of robot manipulation across **24 environments**.
61
+
62
+ ### GR00T N1.7 – SimplerEnv Fractal
63
+
64
+ **Description**
65
+ N1.7 post-trained model using the **Fractal Dataset** in SimplerEnv.
66
+
67
+ **Post-Training Data**
68
+ https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot
69
+
70
+ **Dataset Summary**
71
+ A LeRobot-format conversion of **BridgeData V2**, originally containing **60,096 trajectories** of robot manipulation across **24 environments**.
72
+
73
+ ### GR00T N1.7 – Droid
74
+ **Description**
75
+ N1.7 post-trained model using the **DROID Dataset**.
76
+
77
+ **Post-Training Data**
78
+ https://droid-dataset.github.io/
79
+
80
+ **Dataset Summary**
81
+ A large-scale **“in-the-wild” robot manipulation dataset** with approximately **76,000 demonstration trajectories (~350 hours)** of interaction data, collected across **564 distinct scenes in 52 buildings**, covering **86 manipulation tasks** from natural-language instructions.
82
+
83
+ ### GR00T N1.7 – LIBERO
84
+ **Description**
85
+ N1.7 post-trained model using the **LIBERO Dataset**.
86
+
87
+ **Post-Training Data**
88
+ https://github.com/Lifelong-Robot-Learning/LIBERO
89
+
90
+ **Dataset Summary**
91
+ A benchmark for **lifelong robot learning**, providing **130 language-conditioned manipulation tasks** grouped into multiple task suites.
92
+ Includes **human-teleoperated demonstrations** designed to evaluate **knowledge transfer and continual learning** in robotic agents.
93
+
94
+ ## License
95
+ This model is released under the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
96
+
97
+
98
+ ### Deployment Geography:
99
+ Global
100
+
101
+ ### Use Case:
102
+ Researchers, Academics, Open-Source Community: AI-driven robotics research and algorithm development.
103
+ Developers: Integrate and customize AI for various robotic applications.
104
+ Startups & Companies: Accelerate robotics development and reduce training costs.
105
+
106
+ ### Release Date:
107
+ * Github via https://github.com/NVIDIA/Isaac-GR00T
108
+ * Huggingface via https://huggingface.co/collections/nvidia/gr00t-n17
109
+
110
+ ## Computational Load (Internal Only: For NVIDIA Models Only)
111
+ Cumulative Compute: Follow Instructions
112
+ Estimated Energy and Emissions for Model Training: Follow Instructions
113
+ Total kWh:
114
+ 64 GB200 nodes * 4 gpus per node x 1200W x 0.001 x 0.8 x 120 hours * 1.4 = 41288 kWh
115
+ Total Emission:
116
+ 410.5 * 41288 * 0.000001 = 16.949 tCO2e
117
+
118
+ ## Model Architecture:
119
+
120
+ **GR00T-N1.7 VLM backbone is now [Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-8B)**
121
+
122
+ **Network Architecture:**
123
+
124
+ The schematic diagram is shown in the illustration above.
125
+ Red, Green, Blue (RGB) camera frames are processed through a pre-trained vision transformer (SigLip2).
126
+ Text is encoded by a pre-trained transformer (T5)
127
+ Robot proprioception is encoded using a multi-layer perceptron (MLP) indexed by the embodiment ID. To handle variable-dimension proprioception, inputs are padded to a configurable max length before feeding into the MLP.
128
+ Actions are encoded and velocity predictions decoded by an MLP, one per unique embodiment.
129
+ The flow matching transformer is implemented as a diffusion transformer (DiT), in which the diffusion step conditioning is implemented using adaptive layernorm (AdaLN).
130
+
131
+ ![Model Architecture](model-architecture.png)
132
+
133
+ **Number of Model Parameters:** 3,000,000,000
134
+
135
+ ## Input:
136
+ **Input Type(s):**
137
+ -Vision: Image Frames
138
+ -State: Robot Proprioception
139
+ -Language Instruction: Text
140
+ -Embodiment ID: Integer
141
+
142
+ **Input Format:**
143
+ -Vision: Variable number of uint8 image frames, coming from robot cameras
144
+ -State: Floating Point
145
+ -Language Instruction: String
146
+ -Embodiment ID: Integer indicating which of the training embodiments is observed
147
+
148
+ **Input Parameters:**
149
+ -Vision: Two-Dimensional (2D) - Red, Green, Blue (RGB)
150
+ -State: One-Dimensional (1D) - Floating number vector
151
+ -Language Instruction: One-Dimensional (1D) - String
152
+ -Embodiment ID: One-Dimensional (1D) - Integer
153
+
154
+ ## Output:
155
+ **Output Type(s):** Actions
156
+
157
+ **Output Format** Continuous-value vectors
158
+
159
+ **Output Parameters:** [Two-Dimensional (2D)] <br>
160
+
161
+ **Other Properties Related to Output:** Continuous-value vectors correspond to different motor controls on a robot, which depends on Degrees of Freedom of the robot embodiment.
162
+
163
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
164
+
165
+ ## Software Integration:
166
+
167
+ **Runtime Engine(s):** PyTorch
168
+
169
+ **Supported Hardware Microarchitecture Compatibility:**
170
+ All of the below:
171
+ * NVIDIA Ampere
172
+ * NVIDIA Blackwell
173
+ * NVIDIA Jetson
174
+ * NVIDIA Hopper
175
+ * NVIDIA Lovelace
176
+
177
+ **[Preferred/Supported] Operating System(s):**
178
+ * Linux
179
+
180
+ The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
181
+
182
+ # Model Version
183
+ GR00T N1.7 EA
184
+
185
+ # Training and Evaluation Datasets:
186
+ The total size (in number of data points): 21.6 million <br>
187
+ Total number of datasets: 13 <br>
188
+
189
+
190
+ ## Training Dataset:
191
+ GR00T Pretraining Data
192
+
193
+ **Data Collection Method by dataset:** Hybrid: Human, Robot, Simulated.
194
+
195
+ **Labeling Method by dataset:** Hybrid: Human, Automated.
196
+
197
+ **Properties:**
198
+ * Cross-embodiment: Data collected on various robot embodiments
199
+ * Sensor types: RGB camera, robot proprioception, robot actuator data
200
+
201
+
202
+ ## Evaluation:
203
+ We evaluate in both simulation and real robot benchmarks, as defined in the White Paper (https://arxiv.org/abs/2503.14734).
204
+
205
+ **Data Collection Method by dataset:** Hybrid: Human, Robot, Simulated.
206
+
207
+ **Labeling Method by dataset:** Hybrid: Human, Automated.
208
+
209
+ * Sim evaluation benchmarks for upper body control
210
+ * 9 DexMG Whitepaper tasks
211
+ * 24 RoboCasa simulated mobile manipulator tasks
212
+ * 24 Digital Cousin simulated GR-1 humanoid manipulation tasks
213
+ * For sim, we automatically measure the success rate in each manipulation behavior.
214
+ * For real robot
215
+ * Grocery packing task
216
+ * Novel objects (unseen from training data)
217
+ * Industrial multi-robot coordination with handoffs
218
+ * Evaluated by human observers in the lab
219
+
220
+
221
+ ## Inference:
222
+ **Engine:** PyTorch
223
+ **Test Hardware:** A6000
224
+
225
+ ## Ethical Considerations:
226
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
227
+
228
+ Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
229
+
230
+ For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
231
+
232
+ Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
SAFETY_and_SECURITY.md ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ # **Safety & Security**
2
+
3
+ |Field:|Response:|
4
+ |:---:|:---:|
5
+ |Model Application(s):|Machinery and Robotics<br>Robot VLA - single-arm manipulation, bimanual grippers, bi-manual dex hands manipulation and humanoid dexterous manipulation|
6
+ |Describe life critical application (if present):|This model is not tested or intended for use in mission critical applications that require functional safety. The use of the model in those applications is at the user's own risk and sole responsibility, including taking the necessary steps to add needed guardrails or safety mechanisms prior to deployment.|
7
+ |Use Case Restrictions:|Abide by the [NSCL V1 License](https://developer.download.nvidia.com/licenses/NVIDIA-OneWay-Noncommercial-License-22Mar2022.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczovL3d3dy5nb29nbGUuY29tLyIsIm5jaWQiOiJzby15b3V0LTg3MTcwMS12dDQ4In0=)|
8
+ |Model and Dataset Restrictions:|The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.|
SUCCESS ADDED
File without changes
config.json ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "action_horizon": 40,
3
+ "add_pos_embed": true,
4
+ "apply_sincos_state_encoding": false,
5
+ "architectures": [
6
+ "Gr00tN1d7"
7
+ ],
8
+ "attn_dropout": 0.2,
9
+ "backbone_embedding_dim": 2048,
10
+ "color_jitter_params": {
11
+ "brightness": 0.3,
12
+ "contrast": 0.4,
13
+ "hue": 0.08,
14
+ "saturation": 0.5
15
+ },
16
+ "crop_fraction": 0.95,
17
+ "diffusion_model_cfg": {
18
+ "attention_head_dim": 48,
19
+ "dropout": 0.2,
20
+ "final_dropout": true,
21
+ "interleave_self_attention": true,
22
+ "norm_type": "ada_norm",
23
+ "num_attention_heads": 32,
24
+ "num_layers": 32,
25
+ "output_dim": 1024,
26
+ "positional_embeddings": null
27
+ },
28
+ "dtype": "bfloat16",
29
+ "exclude_state": false,
30
+ "formalize_language": true,
31
+ "hidden_size": 1024,
32
+ "image_crop_size": [
33
+ 230,
34
+ 230
35
+ ],
36
+ "image_target_size": [
37
+ 256,
38
+ 256
39
+ ],
40
+ "letter_box_transform": false,
41
+ "load_bf16": true,
42
+ "max_action_dim": 132,
43
+ "max_num_embodiments": 32,
44
+ "max_seq_len": 1024,
45
+ "max_state_dim": 132,
46
+ "model_dtype": "bfloat16",
47
+ "model_type": "Gr00tN1d7",
48
+ "noise_beta_alpha": 1.5,
49
+ "noise_beta_beta": 1.0,
50
+ "noise_s": 0.999,
51
+ "num_inference_timesteps": 4,
52
+ "num_timestep_buckets": 1000,
53
+ "random_history_crop": true,
54
+ "random_rotation_angle": 0,
55
+ "reproject_vision": false,
56
+ "rtc_ramp_rate": 6.0,
57
+ "select_layer": 16,
58
+ "shortest_image_edge": 256,
59
+ "state_dropout_prob": 0.2,
60
+ "state_gaussian_noise_std": 0.0,
61
+ "transformers_version": "4.57.1",
62
+ "tune_diffusion_model": true,
63
+ "tune_linear": true,
64
+ "tune_llm": true,
65
+ "tune_projector": true,
66
+ "tune_top_llm_layers": 0,
67
+ "tune_visual": true,
68
+ "tune_vlln": true,
69
+ "use_albumentations": true,
70
+ "use_alternate_vl_dit": true,
71
+ "use_flash_attention": true,
72
+ "use_future_tokens": false,
73
+ "use_mean_std": false,
74
+ "use_percentiles": true,
75
+ "use_vl_self_attention": true,
76
+ "use_vlln": true,
77
+ "vl_self_attention_cfg": {
78
+ "attention_head_dim": 64,
79
+ "dropout": 0.2,
80
+ "final_dropout": true,
81
+ "num_attention_heads": 32,
82
+ "num_layers": 4,
83
+ "positional_embeddings": null
84
+ },
85
+ "model_name": "nvidia/Cosmos-Reason2-2B"
86
+ }
embodiment_id.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "robocasa_panda_omron": 13,
3
+ "oxe_droid": 17,
4
+ "oxe_fractal": 18,
5
+ "oxe_language_table": 19,
6
+ "oxe_bridge": 20,
7
+ "unknown": 22,
8
+ "gr1_unified": 20,
9
+ "agibot": 26,
10
+ "sim_behavior_r1_pro": 23,
11
+ "xdof": 24,
12
+ "xdof_oss_data": 25,
13
+ "unitree_g1_full_body_with_waist_height_nav_cmd": 25,
14
+ "real_r1_pro_sharpa": 27,
15
+ "real_r1_pro_sharpa_add_view": 27,
16
+ "real_r1_pro_sharpa_relative_arm_joint": 26,
17
+ "real_r1_pro_sharpa_delta_eef": 26,
18
+ "real_r1_pro_sharpa_absolute_eef": 26,
19
+ "real_r1_pro_sharpa_meanstd": 26,
20
+ "real_r1_pro_sharpa_relative_eef": 26,
21
+ "real_r1_pro_sharpa_relative_eef_add_view": 26,
22
+ "real_r1_pro_sharpa_relative_eef_relative_hand": 26,
23
+ "real_r1_pro_sharpa_relative_eef_human": 26,
24
+ "real_r1_pro_sharpa_relative_eef_human_add_view": 26,
25
+ "real_r1_pro_sharpa_relative_eef_human_relative_hand": 26,
26
+ "real_r1_pro_sharpa_relative_eef_egodex": 26,
27
+ "real_r1_pro_sharpa_relative_eef_egodex_relative_hand": 26,
28
+ "real_r1_pro_sharpa_relative_eef_egodex_wrist_only": 26,
29
+ "real_r1_pro_sharpa_relative_eef_maxinsights": 26,
30
+ "real_r1_pro_sharpa_relative_eef_maxinsights_relative_hand": 26,
31
+ "real_r1_pro_sharpa_relative_eef_mecka": 26,
32
+ "real_r1_pro_sharpa_relative_eef_mecka_relative_hand": 26,
33
+ "real_g1_relative_eef_absolute_joints": 25,
34
+ "real_g1_relative_eef_absolute_joints_wrist_cam": 25,
35
+ "real_g1_relative_eef_relative_joints": 25,
36
+ "real_r1_pro_sharpa_relative_eef_relative_hand_relative_joint": 26,
37
+ "real_r1_pro_sharpa_relative_joint": 29,
38
+ "oxe_droid_relative_eef_relative_joint": 24,
39
+ "oxe_droid_relative_eef_relative_joint_swapped": 24,
40
+ "oxe_droid_relative_eef_relative_joint_upweight_z": 24,
41
+ "oxe_droid_relative_eef_relative_joint_upweight_z_swapped": 24,
42
+ "oxe_droid_relative_eef_relative_joint_3view": 24,
43
+ "oxe_droid_relative_eef_relative_joint_3view_swapped": 24,
44
+ "oxe_droid_relative_eef": 24,
45
+ "oxe_droid_joint_position_relative": 24,
46
+ "xdof_relative_eef_relative_joint": 27,
47
+ "xdof_relative_eef_relative_joint_subtask": 27,
48
+ "xdof_relative_eef": 27,
49
+ "xdof_relative_joint": 28,
50
+ "simpler_env_google": 0,
51
+ "simpler_env_widowx": 1,
52
+ "libero_sim": 2,
53
+ "droid_sim": 3
54
+ }
experiment_cfg/conf.yaml ADDED
@@ -0,0 +1,1324 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ load_config_path: groot/vla/omni/configs/experiments/r1_pro/sharpa/n17_pretrain/n17_pretrain_human_robot_cross_embodiment_fix_yam_absolute_hand_2step.yaml
2
+ model:
3
+ return_dict: true
4
+ output_hidden_states: false
5
+ torchscript: false
6
+ dtype: null
7
+ pruned_heads: {}
8
+ tie_word_embeddings: true
9
+ chunk_size_feed_forward: 0
10
+ is_encoder_decoder: false
11
+ is_decoder: false
12
+ cross_attention_hidden_size: null
13
+ add_cross_attention: false
14
+ tie_encoder_decoder: false
15
+ architectures: null
16
+ finetuning_task: null
17
+ id2label:
18
+ 0: LABEL_0
19
+ 1: LABEL_1
20
+ label2id:
21
+ LABEL_0: 0
22
+ LABEL_1: 1
23
+ task_specific_params: null
24
+ problem_type: null
25
+ tokenizer_class: null
26
+ prefix: null
27
+ bos_token_id: null
28
+ pad_token_id: null
29
+ eos_token_id: null
30
+ sep_token_id: null
31
+ decoder_start_token_id: null
32
+ max_length: 20
33
+ min_length: 0
34
+ do_sample: false
35
+ early_stopping: false
36
+ num_beams: 1
37
+ temperature: 1.0
38
+ top_k: 50
39
+ top_p: 1.0
40
+ typical_p: 1.0
41
+ repetition_penalty: 1.0
42
+ length_penalty: 1.0
43
+ no_repeat_ngram_size: 0
44
+ encoder_no_repeat_ngram_size: 0
45
+ bad_words_ids: null
46
+ num_return_sequences: 1
47
+ output_scores: false
48
+ return_dict_in_generate: false
49
+ forced_bos_token_id: null
50
+ forced_eos_token_id: null
51
+ remove_invalid_values: false
52
+ exponential_decay_length_penalty: null
53
+ suppress_tokens: null
54
+ begin_suppress_tokens: null
55
+ num_beam_groups: 1
56
+ diversity_penalty: 0.0
57
+ transformers_version: null
58
+ model_type: GrootN1d5Qwen
59
+ model_dtype: bfloat16
60
+ vlm_backend: qwen3
61
+ vlm_model_path: nvidia/Cosmos-Reason2-2B
62
+ backbone_embedding_dim: 2048
63
+ tune_llm: false
64
+ tune_top_llm_layers: 0
65
+ tune_visual: false
66
+ tune_linear: true
67
+ select_layer: 16
68
+ reproject_vision: false
69
+ use_flash_attention: true
70
+ load_bf16: true
71
+ exclude_state: false
72
+ image_crop_size:
73
+ - 230
74
+ - 230
75
+ image_target_size:
76
+ - 256
77
+ - 256
78
+ random_rotation_angle: 0
79
+ color_jitter_params:
80
+ brightness: 0.3
81
+ contrast: 0.4
82
+ saturation: 0.5
83
+ hue: 0.08
84
+ formalize_language: true
85
+ action_space_prompt: false
86
+ apply_sincos_state_encoding: false
87
+ letter_box_transform: false
88
+ use_percentiles: true
89
+ use_mean_std: false
90
+ use_albumentations: true
91
+ shortest_image_edge: 256
92
+ crop_fraction: 0.95
93
+ random_history_crop: true
94
+ state_gaussian_noise_std: 0.0
95
+ do_human_interpolation: false
96
+ interpolation_steps: 20
97
+ human_embodiment_tags: null
98
+ max_state_dim: 132
99
+ max_action_dim: 132
100
+ action_horizon: 40
101
+ hidden_size: 1024
102
+ dit_latent_dim: 1536
103
+ state_dropout_prob: 0.2
104
+ language_dropout_prob: 0.0
105
+ add_pos_embed: true
106
+ attn_dropout: 0.2
107
+ use_vlln: true
108
+ use_vl_self_attention: true
109
+ max_seq_len: 1024
110
+ use_future_tokens: false
111
+ use_alternate_vl_dit: true
112
+ vl_self_attention_cfg:
113
+ positional_embeddings: null
114
+ num_layers: 4
115
+ num_attention_heads: 32
116
+ attention_head_dim: 64
117
+ dropout: 0.2
118
+ final_dropout: true
119
+ diffusion_model_cfg:
120
+ positional_embeddings: null
121
+ num_layers: 32
122
+ num_attention_heads: 32
123
+ attention_head_dim: 48
124
+ norm_type: ada_norm
125
+ dropout: 0.2
126
+ final_dropout: true
127
+ output_dim: 1024
128
+ interleave_self_attention: true
129
+ cross_attention_dim: 2048
130
+ num_inference_timesteps: 4
131
+ noise_beta_alpha: 1.5
132
+ noise_beta_beta: 1.0
133
+ noise_s: 0.999
134
+ num_timestep_buckets: 1000
135
+ tune_projector: true
136
+ tune_diffusion_model: true
137
+ tune_vlln: true
138
+ max_num_embodiments: 32
139
+ rtc_ramp_rate: 6.0
140
+ tf_legacy_loss: false
141
+ use_bfloat16: false
142
+ data:
143
+ datasets:
144
+ - dataset_paths:
145
+ - /mnt/aws-lfs-02/shared/datasets/xdof.yam_v7_all_merged_global_task_exclude_bad_subtasks
146
+ embodiment_tag: xdof_relative_eef_relative_joint
147
+ mix_ratio: 0.1
148
+ dataset_type: physical_embodiment
149
+ - dataset_paths:
150
+ - /mnt/aws-lfs-02/shared/datasets/xdof.yam_v7_subtask_only_merged_global_task
151
+ embodiment_tag: xdof_relative_eef_relative_joint_subtask
152
+ mix_ratio: 0.2
153
+ dataset_type: physical_embodiment
154
+ - dataset_paths:
155
+ - /mnt/aws-lfs-02/shared/datasets/droid_101_success_idlefiltered_n17
156
+ - /mnt/aws-lfs-02/shared/datasets/droid_101_success_idlefiltered_n17_swapped
157
+ embodiment_tag: oxe_droid_relative_eef_relative_joint
158
+ mix_ratio: 0.1
159
+ dataset_type: physical_embodiment
160
+ - dataset_paths:
161
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_g1.g1-in-the-wild-merged
162
+ embodiment_tag: real_g1_relative_eef_relative_joints
163
+ mix_ratio: 0.05
164
+ dataset_type: physical_embodiment
165
+ - dataset_paths:
166
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.inlab_play_real_robot_batch_1
167
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.inlab_play_real_robot_batch_2
168
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.miscellaneous_1k_trajectories
169
+ embodiment_tag: real_r1_pro_sharpa_relative_eef
170
+ mix_ratio: 0.05
171
+ dataset_type: physical_embodiment
172
+ - dataset_paths:
173
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch1-2025-12-10-merged
174
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch3_2026-01-04-merged_backup
175
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch4_2026-01-05-merged_backup
176
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch5_2026-01-05-merged_backup
177
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch6_2026-01-05-merged_backup
178
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch10_2026-01-10-merged_backup
179
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch11_2026-01-10-merged_backup
180
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch12_2026-01-10-merged_backup
181
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch8_2026-01-10-merged_backup
182
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch9_2026-01-10-merged_backup
183
+ embodiment_tag: real_r1_pro_sharpa_relative_eef_mecka
184
+ mix_ratio: 0.25
185
+ dataset_type: physical_embodiment
186
+ - dataset_paths:
187
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/maxinsights_lerobot_updated/1530hrs/real_r1_pro_sharpa.maxinsights_1530hrs_updated_train_set_merged
188
+ embodiment_tag: real_r1_pro_sharpa_relative_eef_maxinsights
189
+ mix_ratio: 0.2
190
+ dataset_type: physical_embodiment
191
+ - dataset_paths:
192
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.inlab_play_human_batch1
193
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.inlab_play_human_batch2
194
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.shirt_rolling_task24_2000_human_video_filter_n6_keep1619_demo_stats
195
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.shirt_rolling_task15_2000_human_video_filter_n6_keep572_demo_stats
196
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.sort_cards_human_filter_n6_keep523_demo_stats_overwrite_left_side_stats
197
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.tong_task38_2000_human_video_overwrite_left_side_stats
198
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.syringe_task30i_2000_human_video_filtered
199
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.unscrew_bottle_task43_2000_human_video_fixed-duration
200
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.unscrew_Jim_bottle_task47_600_human_video
201
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.fold_shirt_task30b_500_human_video_halfdone
202
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.fold_towel_task30c_500_human_video_halfdone
203
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.sort_cards_task32e_1000_human_video
204
+ embodiment_tag: real_r1_pro_sharpa_relative_eef_human
205
+ mix_ratio: 0.05
206
+ dataset_type: physical_embodiment
207
+ modality_configs:
208
+ real_g1_relative_eef_relative_joints:
209
+ video:
210
+ delta_indices:
211
+ - -20
212
+ - 0
213
+ modality_keys:
214
+ - ego_view
215
+ normalization_mode: null
216
+ action_representation: null
217
+ exclude_state: false
218
+ action_type: null
219
+ action_format: null
220
+ normalize_rotation: true
221
+ wrist_keys: null
222
+ hand_keys: null
223
+ extra_keys: null
224
+ loss_weights: null
225
+ state:
226
+ delta_indices:
227
+ - 0
228
+ modality_keys:
229
+ - left_wrist_eef_9d
230
+ - right_wrist_eef_9d
231
+ - left_hand
232
+ - right_hand
233
+ - left_arm
234
+ - right_arm
235
+ - waist
236
+ normalization_mode: null
237
+ action_representation: null
238
+ exclude_state: false
239
+ action_type: null
240
+ action_format: null
241
+ normalize_rotation: true
242
+ wrist_keys: null
243
+ hand_keys: null
244
+ extra_keys: null
245
+ loss_weights: null
246
+ action:
247
+ delta_indices:
248
+ - 0
249
+ - 1
250
+ - 2
251
+ - 3
252
+ - 4
253
+ - 5
254
+ - 6
255
+ - 7
256
+ - 8
257
+ - 9
258
+ - 10
259
+ - 11
260
+ - 12
261
+ - 13
262
+ - 14
263
+ - 15
264
+ - 16
265
+ - 17
266
+ - 18
267
+ - 19
268
+ - 20
269
+ - 21
270
+ - 22
271
+ - 23
272
+ - 24
273
+ - 25
274
+ - 26
275
+ - 27
276
+ - 28
277
+ - 29
278
+ - 30
279
+ - 31
280
+ - 32
281
+ - 33
282
+ - 34
283
+ - 35
284
+ - 36
285
+ - 37
286
+ - 38
287
+ - 39
288
+ modality_keys:
289
+ - left_wrist_eef_9d
290
+ - right_wrist_eef_9d
291
+ - left_hand
292
+ - right_hand
293
+ - left_arm
294
+ - right_arm
295
+ - waist
296
+ - base_height_command
297
+ - navigate_command
298
+ normalization_mode: null
299
+ action_representation:
300
+ - {}
301
+ - {}
302
+ - {}
303
+ - {}
304
+ - {}
305
+ - {}
306
+ - {}
307
+ - {}
308
+ - {}
309
+ exclude_state: false
310
+ action_type:
311
+ - {}
312
+ - {}
313
+ - {}
314
+ - {}
315
+ - {}
316
+ - {}
317
+ - {}
318
+ - {}
319
+ - {}
320
+ action_format:
321
+ - {}
322
+ - {}
323
+ - {}
324
+ - {}
325
+ - {}
326
+ - {}
327
+ - {}
328
+ - {}
329
+ - {}
330
+ normalize_rotation: true
331
+ wrist_keys:
332
+ - left_wrist_eef_9d
333
+ - right_wrist_eef_9d
334
+ hand_keys:
335
+ - left_hand
336
+ - right_hand
337
+ extra_keys:
338
+ - left_arm
339
+ - right_arm
340
+ - waist
341
+ - base_height_command
342
+ - navigate_command
343
+ loss_weights: null
344
+ language:
345
+ delta_indices:
346
+ - 0
347
+ modality_keys:
348
+ - annotation.human.task_description
349
+ normalization_mode: null
350
+ action_representation: null
351
+ exclude_state: false
352
+ action_type: null
353
+ action_format: null
354
+ normalize_rotation: true
355
+ wrist_keys: null
356
+ hand_keys: null
357
+ extra_keys: null
358
+ loss_weights: null
359
+ real_r1_pro_sharpa_relative_eef_mecka:
360
+ video:
361
+ delta_indices:
362
+ - -30
363
+ - 0
364
+ modality_keys:
365
+ - ego_view_cropratio_res320x240_freq30
366
+ normalization_mode: null
367
+ action_representation: null
368
+ exclude_state: false
369
+ action_type: null
370
+ action_format: null
371
+ normalize_rotation: true
372
+ wrist_keys: null
373
+ hand_keys: null
374
+ extra_keys: null
375
+ loss_weights: null
376
+ state:
377
+ delta_indices:
378
+ - 0
379
+ modality_keys:
380
+ - left_wrist_eef
381
+ - right_wrist_eef
382
+ - left_hand_joints
383
+ - right_hand_joints
384
+ normalization_mode: null
385
+ action_representation: null
386
+ exclude_state: true
387
+ action_type: null
388
+ action_format: null
389
+ normalize_rotation: true
390
+ wrist_keys: null
391
+ hand_keys: null
392
+ extra_keys: null
393
+ loss_weights: null
394
+ action:
395
+ delta_indices:
396
+ - 0
397
+ - 1
398
+ - 2
399
+ - 3
400
+ - 4
401
+ - 5
402
+ - 6
403
+ - 7
404
+ - 8
405
+ - 9
406
+ - 10
407
+ - 11
408
+ - 12
409
+ - 13
410
+ - 14
411
+ - 15
412
+ - 16
413
+ - 17
414
+ - 18
415
+ - 19
416
+ - 20
417
+ - 21
418
+ - 22
419
+ - 23
420
+ - 24
421
+ - 25
422
+ - 26
423
+ - 27
424
+ - 28
425
+ - 29
426
+ - 30
427
+ - 31
428
+ - 32
429
+ - 33
430
+ - 34
431
+ - 35
432
+ - 36
433
+ - 37
434
+ - 38
435
+ - 39
436
+ modality_keys:
437
+ - left_wrist_eef
438
+ - right_wrist_eef
439
+ - left_hand_joints
440
+ - right_hand_joints
441
+ normalization_mode: null
442
+ action_representation:
443
+ - {}
444
+ - {}
445
+ - {}
446
+ - {}
447
+ exclude_state: false
448
+ action_type:
449
+ - {}
450
+ - {}
451
+ - {}
452
+ - {}
453
+ action_format:
454
+ - {}
455
+ - {}
456
+ - {}
457
+ - {}
458
+ normalize_rotation: true
459
+ wrist_keys:
460
+ - left_wrist_eef
461
+ - right_wrist_eef
462
+ hand_keys:
463
+ - left_hand_joints
464
+ - right_hand_joints
465
+ extra_keys: []
466
+ loss_weights: null
467
+ language:
468
+ delta_indices:
469
+ - 0
470
+ modality_keys:
471
+ - annotation.human.coarse_action
472
+ normalization_mode: null
473
+ action_representation: null
474
+ exclude_state: false
475
+ action_type: null
476
+ action_format: null
477
+ normalize_rotation: true
478
+ wrist_keys: null
479
+ hand_keys: null
480
+ extra_keys: null
481
+ loss_weights: null
482
+ oxe_droid_relative_eef_relative_joint:
483
+ video:
484
+ delta_indices:
485
+ - -15
486
+ - 0
487
+ modality_keys:
488
+ - exterior_image_1_left
489
+ - wrist_image_left
490
+ normalization_mode: null
491
+ action_representation: null
492
+ exclude_state: false
493
+ action_type: null
494
+ action_format: null
495
+ normalize_rotation: true
496
+ wrist_keys: null
497
+ hand_keys: null
498
+ extra_keys: null
499
+ loss_weights: null
500
+ state:
501
+ delta_indices:
502
+ - 0
503
+ modality_keys:
504
+ - eef_9d
505
+ - gripper_position
506
+ - joint_position
507
+ normalization_mode: null
508
+ action_representation: null
509
+ exclude_state: false
510
+ action_type: null
511
+ action_format: null
512
+ normalize_rotation: true
513
+ wrist_keys: null
514
+ hand_keys: null
515
+ extra_keys: null
516
+ loss_weights: null
517
+ action:
518
+ delta_indices:
519
+ - 0
520
+ - 1
521
+ - 2
522
+ - 3
523
+ - 4
524
+ - 5
525
+ - 6
526
+ - 7
527
+ - 8
528
+ - 9
529
+ - 10
530
+ - 11
531
+ - 12
532
+ - 13
533
+ - 14
534
+ - 15
535
+ - 16
536
+ - 17
537
+ - 18
538
+ - 19
539
+ - 20
540
+ - 21
541
+ - 22
542
+ - 23
543
+ - 24
544
+ - 25
545
+ - 26
546
+ - 27
547
+ - 28
548
+ - 29
549
+ - 30
550
+ - 31
551
+ - 32
552
+ - 33
553
+ - 34
554
+ - 35
555
+ - 36
556
+ - 37
557
+ - 38
558
+ - 39
559
+ modality_keys:
560
+ - eef_9d
561
+ - gripper_position
562
+ - joint_position
563
+ normalization_mode: null
564
+ action_representation:
565
+ - {}
566
+ - {}
567
+ - {}
568
+ exclude_state: false
569
+ action_type:
570
+ - {}
571
+ - {}
572
+ - {}
573
+ action_format:
574
+ - {}
575
+ - {}
576
+ - {}
577
+ normalize_rotation: true
578
+ wrist_keys:
579
+ - eef_9d
580
+ hand_keys:
581
+ - gripper_position
582
+ extra_keys:
583
+ - joint_position
584
+ loss_weights: null
585
+ language:
586
+ delta_indices:
587
+ - 0
588
+ modality_keys:
589
+ - annotation.language.language_instruction
590
+ - annotation.language.language_instruction_2
591
+ - annotation.language.language_instruction_3
592
+ normalization_mode: null
593
+ action_representation: null
594
+ exclude_state: false
595
+ action_type: null
596
+ action_format: null
597
+ normalize_rotation: true
598
+ wrist_keys: null
599
+ hand_keys: null
600
+ extra_keys: null
601
+ loss_weights: null
602
+ real_r1_pro_sharpa_relative_eef_human:
603
+ video:
604
+ delta_indices:
605
+ - -20
606
+ - 0
607
+ modality_keys:
608
+ - ego_view_res320x240_freq20
609
+ - left_wrist_view_res320x240_freq20
610
+ - right_wrist_view_res320x240_freq20
611
+ normalization_mode: null
612
+ action_representation: null
613
+ exclude_state: false
614
+ action_type: null
615
+ action_format: null
616
+ normalize_rotation: true
617
+ wrist_keys: null
618
+ hand_keys: null
619
+ extra_keys: null
620
+ loss_weights: null
621
+ state:
622
+ delta_indices:
623
+ - 0
624
+ modality_keys:
625
+ - left_wrist_eef
626
+ - right_wrist_eef
627
+ - left_hand_joints
628
+ - right_hand_joints
629
+ normalization_mode: null
630
+ action_representation: null
631
+ exclude_state: true
632
+ action_type: null
633
+ action_format: null
634
+ normalize_rotation: true
635
+ wrist_keys: null
636
+ hand_keys: null
637
+ extra_keys: null
638
+ loss_weights: null
639
+ action:
640
+ delta_indices:
641
+ - 0
642
+ - 1
643
+ - 2
644
+ - 3
645
+ - 4
646
+ - 5
647
+ - 6
648
+ - 7
649
+ - 8
650
+ - 9
651
+ - 10
652
+ - 11
653
+ - 12
654
+ - 13
655
+ - 14
656
+ - 15
657
+ - 16
658
+ - 17
659
+ - 18
660
+ - 19
661
+ - 20
662
+ - 21
663
+ - 22
664
+ - 23
665
+ - 24
666
+ - 25
667
+ - 26
668
+ - 27
669
+ - 28
670
+ - 29
671
+ - 30
672
+ - 31
673
+ - 32
674
+ - 33
675
+ - 34
676
+ - 35
677
+ - 36
678
+ - 37
679
+ - 38
680
+ - 39
681
+ modality_keys:
682
+ - left_wrist_eef
683
+ - right_wrist_eef
684
+ - left_hand_joints
685
+ - right_hand_joints
686
+ normalization_mode: null
687
+ action_representation:
688
+ - {}
689
+ - {}
690
+ - {}
691
+ - {}
692
+ exclude_state: false
693
+ action_type:
694
+ - {}
695
+ - {}
696
+ - {}
697
+ - {}
698
+ action_format:
699
+ - {}
700
+ - {}
701
+ - {}
702
+ - {}
703
+ normalize_rotation: true
704
+ wrist_keys:
705
+ - left_wrist_eef
706
+ - right_wrist_eef
707
+ hand_keys:
708
+ - left_hand_joints
709
+ - right_hand_joints
710
+ extra_keys: []
711
+ loss_weights: null
712
+ language:
713
+ delta_indices:
714
+ - 0
715
+ modality_keys:
716
+ - annotation.human.coarse_action
717
+ normalization_mode: null
718
+ action_representation: null
719
+ exclude_state: false
720
+ action_type: null
721
+ action_format: null
722
+ normalize_rotation: true
723
+ wrist_keys: null
724
+ hand_keys: null
725
+ extra_keys: null
726
+ loss_weights: null
727
+ xdof_relative_eef_relative_joint:
728
+ video:
729
+ delta_indices:
730
+ - -30
731
+ - 0
732
+ modality_keys:
733
+ - top_camera-images-rgb_320_240
734
+ - left_camera-images-rgb_320_240
735
+ - right_camera-images-rgb_320_240
736
+ normalization_mode: null
737
+ action_representation: null
738
+ exclude_state: false
739
+ action_type: null
740
+ action_format: null
741
+ normalize_rotation: true
742
+ wrist_keys: null
743
+ hand_keys: null
744
+ extra_keys: null
745
+ loss_weights: null
746
+ state:
747
+ delta_indices:
748
+ - 0
749
+ modality_keys:
750
+ - left_wrist_eef
751
+ - right_wrist_eef
752
+ - left_gripper_pos
753
+ - right_gripper_pos
754
+ - left_joint_pos
755
+ - right_joint_pos
756
+ normalization_mode: null
757
+ action_representation: null
758
+ exclude_state: false
759
+ action_type: null
760
+ action_format: null
761
+ normalize_rotation: true
762
+ wrist_keys: null
763
+ hand_keys: null
764
+ extra_keys: null
765
+ loss_weights: null
766
+ action:
767
+ delta_indices:
768
+ - 0
769
+ - 1
770
+ - 2
771
+ - 3
772
+ - 4
773
+ - 5
774
+ - 6
775
+ - 7
776
+ - 8
777
+ - 9
778
+ - 10
779
+ - 11
780
+ - 12
781
+ - 13
782
+ - 14
783
+ - 15
784
+ - 16
785
+ - 17
786
+ - 18
787
+ - 19
788
+ - 20
789
+ - 21
790
+ - 22
791
+ - 23
792
+ - 24
793
+ - 25
794
+ - 26
795
+ - 27
796
+ - 28
797
+ - 29
798
+ - 30
799
+ - 31
800
+ - 32
801
+ - 33
802
+ - 34
803
+ - 35
804
+ - 36
805
+ - 37
806
+ - 38
807
+ - 39
808
+ modality_keys:
809
+ - left_wrist_eef
810
+ - right_wrist_eef
811
+ - left_gripper_pos
812
+ - right_gripper_pos
813
+ - left_joint_pos
814
+ - right_joint_pos
815
+ normalization_mode: null
816
+ action_representation:
817
+ - {}
818
+ - {}
819
+ - {}
820
+ - {}
821
+ - {}
822
+ - {}
823
+ exclude_state: false
824
+ action_type:
825
+ - {}
826
+ - {}
827
+ - {}
828
+ - {}
829
+ - {}
830
+ - {}
831
+ action_format:
832
+ - {}
833
+ - {}
834
+ - {}
835
+ - {}
836
+ - {}
837
+ - {}
838
+ normalize_rotation: true
839
+ wrist_keys:
840
+ - left_wrist_eef
841
+ - right_wrist_eef
842
+ hand_keys:
843
+ - left_gripper_pos
844
+ - right_gripper_pos
845
+ extra_keys:
846
+ - left_joint_pos
847
+ - right_joint_pos
848
+ loss_weights: null
849
+ language:
850
+ delta_indices:
851
+ - 0
852
+ modality_keys:
853
+ - annotation.task
854
+ normalization_mode: null
855
+ action_representation: null
856
+ exclude_state: false
857
+ action_type: null
858
+ action_format: null
859
+ normalize_rotation: true
860
+ wrist_keys: null
861
+ hand_keys: null
862
+ extra_keys: null
863
+ loss_weights: null
864
+ xdof_relative_eef_relative_joint_subtask:
865
+ video:
866
+ delta_indices:
867
+ - -30
868
+ - 0
869
+ modality_keys:
870
+ - top_camera-images-rgb_320_240
871
+ - left_camera-images-rgb_320_240
872
+ - right_camera-images-rgb_320_240
873
+ normalization_mode: null
874
+ action_representation: null
875
+ exclude_state: false
876
+ action_type: null
877
+ action_format: null
878
+ normalize_rotation: true
879
+ wrist_keys: null
880
+ hand_keys: null
881
+ extra_keys: null
882
+ loss_weights: null
883
+ state:
884
+ delta_indices:
885
+ - 0
886
+ modality_keys:
887
+ - left_wrist_eef
888
+ - right_wrist_eef
889
+ - left_gripper_pos
890
+ - right_gripper_pos
891
+ - left_joint_pos
892
+ - right_joint_pos
893
+ normalization_mode: null
894
+ action_representation: null
895
+ exclude_state: false
896
+ action_type: null
897
+ action_format: null
898
+ normalize_rotation: true
899
+ wrist_keys: null
900
+ hand_keys: null
901
+ extra_keys: null
902
+ loss_weights: null
903
+ action:
904
+ delta_indices:
905
+ - 0
906
+ - 1
907
+ - 2
908
+ - 3
909
+ - 4
910
+ - 5
911
+ - 6
912
+ - 7
913
+ - 8
914
+ - 9
915
+ - 10
916
+ - 11
917
+ - 12
918
+ - 13
919
+ - 14
920
+ - 15
921
+ - 16
922
+ - 17
923
+ - 18
924
+ - 19
925
+ - 20
926
+ - 21
927
+ - 22
928
+ - 23
929
+ - 24
930
+ - 25
931
+ - 26
932
+ - 27
933
+ - 28
934
+ - 29
935
+ - 30
936
+ - 31
937
+ - 32
938
+ - 33
939
+ - 34
940
+ - 35
941
+ - 36
942
+ - 37
943
+ - 38
944
+ - 39
945
+ modality_keys:
946
+ - left_wrist_eef
947
+ - right_wrist_eef
948
+ - left_gripper_pos
949
+ - right_gripper_pos
950
+ - left_joint_pos
951
+ - right_joint_pos
952
+ normalization_mode: null
953
+ action_representation:
954
+ - {}
955
+ - {}
956
+ - {}
957
+ - {}
958
+ - {}
959
+ - {}
960
+ exclude_state: false
961
+ action_type:
962
+ - {}
963
+ - {}
964
+ - {}
965
+ - {}
966
+ - {}
967
+ - {}
968
+ action_format:
969
+ - {}
970
+ - {}
971
+ - {}
972
+ - {}
973
+ - {}
974
+ - {}
975
+ normalize_rotation: true
976
+ wrist_keys:
977
+ - left_wrist_eef
978
+ - right_wrist_eef
979
+ hand_keys:
980
+ - left_gripper_pos
981
+ - right_gripper_pos
982
+ extra_keys:
983
+ - left_joint_pos
984
+ - right_joint_pos
985
+ loss_weights: null
986
+ language:
987
+ delta_indices:
988
+ - 0
989
+ modality_keys:
990
+ - annotation.sub_task
991
+ normalization_mode: null
992
+ action_representation: null
993
+ exclude_state: false
994
+ action_type: null
995
+ action_format: null
996
+ normalize_rotation: true
997
+ wrist_keys: null
998
+ hand_keys: null
999
+ extra_keys: null
1000
+ loss_weights: null
1001
+ real_r1_pro_sharpa_relative_eef:
1002
+ video:
1003
+ delta_indices:
1004
+ - -20
1005
+ - 0
1006
+ modality_keys:
1007
+ - ego_view_res320x240_freq20
1008
+ - left_wrist_view_res320x240_freq20
1009
+ - right_wrist_view_res320x240_freq20
1010
+ normalization_mode: null
1011
+ action_representation: null
1012
+ exclude_state: false
1013
+ action_type: null
1014
+ action_format: null
1015
+ normalize_rotation: true
1016
+ wrist_keys: null
1017
+ hand_keys: null
1018
+ extra_keys: null
1019
+ loss_weights: null
1020
+ state:
1021
+ delta_indices:
1022
+ - 0
1023
+ modality_keys:
1024
+ - left_wrist_eef
1025
+ - right_wrist_eef
1026
+ - left_hand_joints
1027
+ - right_hand_joints
1028
+ normalization_mode: null
1029
+ action_representation: null
1030
+ exclude_state: false
1031
+ action_type: null
1032
+ action_format: null
1033
+ normalize_rotation: true
1034
+ wrist_keys: null
1035
+ hand_keys: null
1036
+ extra_keys: null
1037
+ loss_weights: null
1038
+ action:
1039
+ delta_indices:
1040
+ - 0
1041
+ - 1
1042
+ - 2
1043
+ - 3
1044
+ - 4
1045
+ - 5
1046
+ - 6
1047
+ - 7
1048
+ - 8
1049
+ - 9
1050
+ - 10
1051
+ - 11
1052
+ - 12
1053
+ - 13
1054
+ - 14
1055
+ - 15
1056
+ - 16
1057
+ - 17
1058
+ - 18
1059
+ - 19
1060
+ - 20
1061
+ - 21
1062
+ - 22
1063
+ - 23
1064
+ - 24
1065
+ - 25
1066
+ - 26
1067
+ - 27
1068
+ - 28
1069
+ - 29
1070
+ - 30
1071
+ - 31
1072
+ - 32
1073
+ - 33
1074
+ - 34
1075
+ - 35
1076
+ - 36
1077
+ - 37
1078
+ - 38
1079
+ - 39
1080
+ modality_keys:
1081
+ - left_wrist_eef
1082
+ - right_wrist_eef
1083
+ - left_hand_joints
1084
+ - right_hand_joints
1085
+ normalization_mode: null
1086
+ action_representation:
1087
+ - {}
1088
+ - {}
1089
+ - {}
1090
+ - {}
1091
+ exclude_state: false
1092
+ action_type:
1093
+ - {}
1094
+ - {}
1095
+ - {}
1096
+ - {}
1097
+ action_format:
1098
+ - {}
1099
+ - {}
1100
+ - {}
1101
+ - {}
1102
+ normalize_rotation: true
1103
+ wrist_keys:
1104
+ - left_wrist_eef
1105
+ - right_wrist_eef
1106
+ hand_keys:
1107
+ - left_hand_joints
1108
+ - right_hand_joints
1109
+ extra_keys: []
1110
+ loss_weights: null
1111
+ language:
1112
+ delta_indices:
1113
+ - 0
1114
+ modality_keys:
1115
+ - annotation.human.coarse_action
1116
+ normalization_mode: null
1117
+ action_representation: null
1118
+ exclude_state: false
1119
+ action_type: null
1120
+ action_format: null
1121
+ normalize_rotation: true
1122
+ wrist_keys: null
1123
+ hand_keys: null
1124
+ extra_keys: null
1125
+ loss_weights: null
1126
+ real_r1_pro_sharpa_relative_eef_maxinsights:
1127
+ video:
1128
+ delta_indices:
1129
+ - -30
1130
+ - 0
1131
+ modality_keys:
1132
+ - ego_view_cropratio_res320x240_freq30
1133
+ normalization_mode: null
1134
+ action_representation: null
1135
+ exclude_state: false
1136
+ action_type: null
1137
+ action_format: null
1138
+ normalize_rotation: true
1139
+ wrist_keys: null
1140
+ hand_keys: null
1141
+ extra_keys: null
1142
+ loss_weights: null
1143
+ state:
1144
+ delta_indices:
1145
+ - 0
1146
+ modality_keys:
1147
+ - left_wrist_eef
1148
+ - right_wrist_eef
1149
+ - left_hand_joints
1150
+ - right_hand_joints
1151
+ normalization_mode: null
1152
+ action_representation: null
1153
+ exclude_state: true
1154
+ action_type: null
1155
+ action_format: null
1156
+ normalize_rotation: true
1157
+ wrist_keys: null
1158
+ hand_keys: null
1159
+ extra_keys: null
1160
+ loss_weights: null
1161
+ action:
1162
+ delta_indices:
1163
+ - 0
1164
+ - 1
1165
+ - 2
1166
+ - 3
1167
+ - 4
1168
+ - 5
1169
+ - 6
1170
+ - 7
1171
+ - 8
1172
+ - 9
1173
+ - 10
1174
+ - 11
1175
+ - 12
1176
+ - 13
1177
+ - 14
1178
+ - 15
1179
+ - 16
1180
+ - 17
1181
+ - 18
1182
+ - 19
1183
+ - 20
1184
+ - 21
1185
+ - 22
1186
+ - 23
1187
+ - 24
1188
+ - 25
1189
+ - 26
1190
+ - 27
1191
+ - 28
1192
+ - 29
1193
+ - 30
1194
+ - 31
1195
+ - 32
1196
+ - 33
1197
+ - 34
1198
+ - 35
1199
+ - 36
1200
+ - 37
1201
+ - 38
1202
+ - 39
1203
+ modality_keys:
1204
+ - left_wrist_eef
1205
+ - right_wrist_eef
1206
+ - left_hand_joints
1207
+ - right_hand_joints
1208
+ normalization_mode: null
1209
+ action_representation:
1210
+ - {}
1211
+ - {}
1212
+ - {}
1213
+ - {}
1214
+ exclude_state: false
1215
+ action_type:
1216
+ - {}
1217
+ - {}
1218
+ - {}
1219
+ - {}
1220
+ action_format:
1221
+ - {}
1222
+ - {}
1223
+ - {}
1224
+ - {}
1225
+ normalize_rotation: true
1226
+ wrist_keys:
1227
+ - left_wrist_eef
1228
+ - right_wrist_eef
1229
+ hand_keys:
1230
+ - left_hand_joints
1231
+ - right_hand_joints
1232
+ extra_keys: []
1233
+ loss_weights: null
1234
+ language:
1235
+ delta_indices:
1236
+ - 0
1237
+ modality_keys:
1238
+ - annotation.human.coarse_action
1239
+ normalization_mode: null
1240
+ action_representation: null
1241
+ exclude_state: false
1242
+ action_type: null
1243
+ action_format: null
1244
+ normalize_rotation: true
1245
+ wrist_keys: null
1246
+ hand_keys: null
1247
+ extra_keys: null
1248
+ loss_weights: null
1249
+ download_cache: false
1250
+ shard_size: 1024
1251
+ episode_sampling_rate: 0.1
1252
+ num_shards_per_epoch: 100000
1253
+ override_pretraining_statistics: false
1254
+ mode: single_turn
1255
+ random_chop: 0.0
1256
+ mock_dataset_mode: false
1257
+ num_prompt_trajectories: 2
1258
+ variable_num_demos: false
1259
+ max_prompt_trajectories: 5
1260
+ shuffle: true
1261
+ seed: 24
1262
+ subsample_ratio: 1.0
1263
+ image_crop_size:
1264
+ - 244
1265
+ - 244
1266
+ image_target_size:
1267
+ - 224
1268
+ - 224
1269
+ video_backend: torchcodec
1270
+ training:
1271
+ output_dir: nvidia/Cosmos-Reason2-2B
1272
+ experiment_name: null
1273
+ max_steps: 200000
1274
+ global_batch_size: 1024
1275
+ batch_size: 32
1276
+ gradient_accumulation_steps: 1
1277
+ use_muon: false
1278
+ muon_lr: 0.005
1279
+ use_legacy_wd_application: false
1280
+ learning_rate: 5.0e-05
1281
+ lr_scheduler_type: cosine
1282
+ weight_decay: 1.0e-05
1283
+ warmup_ratio: 0.05
1284
+ warmup_steps: 0
1285
+ max_grad_norm: 1.0
1286
+ wsd_stable_ratio: 0.8
1287
+ wsd_decay_type: cosine
1288
+ optim: adamw_torch_fused
1289
+ start_from_checkpoint: null
1290
+ tf32: true
1291
+ fp16: false
1292
+ bf16: true
1293
+ eval_bf16: true
1294
+ logging_steps: 10
1295
+ save_steps: 1000
1296
+ save_total_limit: 5
1297
+ save_vl_model: false
1298
+ upload_checkpoints: true
1299
+ upload_every: 1000
1300
+ upload_last_n_checkpoints: 5
1301
+ max_concurrent_uploads: 2
1302
+ eval_strategy: 'no'
1303
+ eval_steps: 500
1304
+ eval_set_split_ratio: 0.1
1305
+ eval_batch_size: 2
1306
+ save_best_eval_metric_name: ''
1307
+ save_best_eval_metric_greater_is_better: true
1308
+ deepspeed_stage: 2
1309
+ gradient_checkpointing: false
1310
+ use_ddp: false
1311
+ num_gpus: 256
1312
+ dataloader_num_workers: 4
1313
+ remove_unused_columns: false
1314
+ use_wandb: true
1315
+ wandb_project: human_pretraining_n15_galaxea_sharpa
1316
+ enable_profiling: false
1317
+ max_retries: 3
1318
+ skip_spike: true
1319
+ skip_spike_threshold: 5.0
1320
+ skip_spike_ema_alpha: 0.99
1321
+ skip_spike_max_consecutive: 10
1322
+ assert_loss_less_than: null
1323
+ max_steps: 200000
1324
+ save_steps: 1000
experiment_cfg/config.yaml ADDED
@@ -0,0 +1,1341 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ !!python/object:groot.vla.omni.configs.base_config.Config
2
+ data: !!python/object:groot.vla.omni.configs.data.data_config.DataConfig
3
+ datasets:
4
+ - !!python/object:groot.vla.omni.configs.data.data_config.SingleDatasetConfig
5
+ dataset_paths:
6
+ - /mnt/aws-lfs-02/shared/datasets/xdof.yam_v7_all_merged_global_task_exclude_bad_subtasks
7
+ dataset_type: physical_embodiment
8
+ embodiment_tag: xdof_relative_eef_relative_joint
9
+ mix_ratio: 0.1
10
+ - !!python/object:groot.vla.omni.configs.data.data_config.SingleDatasetConfig
11
+ dataset_paths:
12
+ - /mnt/aws-lfs-02/shared/datasets/xdof.yam_v7_subtask_only_merged_global_task
13
+ dataset_type: physical_embodiment
14
+ embodiment_tag: xdof_relative_eef_relative_joint_subtask
15
+ mix_ratio: 0.2
16
+ - !!python/object:groot.vla.omni.configs.data.data_config.SingleDatasetConfig
17
+ dataset_paths:
18
+ - /mnt/aws-lfs-02/shared/datasets/droid_101_success_idlefiltered_n17
19
+ - /mnt/aws-lfs-02/shared/datasets/droid_101_success_idlefiltered_n17_swapped
20
+ dataset_type: physical_embodiment
21
+ embodiment_tag: oxe_droid_relative_eef_relative_joint
22
+ mix_ratio: 0.1
23
+ - !!python/object:groot.vla.omni.configs.data.data_config.SingleDatasetConfig
24
+ dataset_paths:
25
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_g1.g1-in-the-wild-merged
26
+ dataset_type: physical_embodiment
27
+ embodiment_tag: real_g1_relative_eef_relative_joints
28
+ mix_ratio: 0.05
29
+ - !!python/object:groot.vla.omni.configs.data.data_config.SingleDatasetConfig
30
+ dataset_paths:
31
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.inlab_play_real_robot_batch_1
32
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.inlab_play_real_robot_batch_2
33
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.miscellaneous_1k_trajectories
34
+ dataset_type: physical_embodiment
35
+ embodiment_tag: real_r1_pro_sharpa_relative_eef
36
+ mix_ratio: 0.05
37
+ - !!python/object:groot.vla.omni.configs.data.data_config.SingleDatasetConfig
38
+ dataset_paths:
39
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch1-2025-12-10-merged
40
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch3_2026-01-04-merged_backup
41
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch4_2026-01-05-merged_backup
42
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch5_2026-01-05-merged_backup
43
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch6_2026-01-05-merged_backup
44
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch10_2026-01-10-merged_backup
45
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch11_2026-01-10-merged_backup
46
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch12_2026-01-10-merged_backup
47
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch8_2026-01-10-merged_backup
48
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/mecka_lerobot/real_r1_pro_sharpa.mecka_batch9_2026-01-10-merged_backup
49
+ dataset_type: physical_embodiment
50
+ embodiment_tag: real_r1_pro_sharpa_relative_eef_mecka
51
+ mix_ratio: 0.25
52
+ - !!python/object:groot.vla.omni.configs.data.data_config.SingleDatasetConfig
53
+ dataset_paths:
54
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/maxinsights_lerobot_updated/1530hrs/real_r1_pro_sharpa.maxinsights_1530hrs_updated_train_set_merged
55
+ dataset_type: physical_embodiment
56
+ embodiment_tag: real_r1_pro_sharpa_relative_eef_maxinsights
57
+ mix_ratio: 0.2
58
+ - !!python/object:groot.vla.omni.configs.data.data_config.SingleDatasetConfig
59
+ dataset_paths:
60
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.inlab_play_human_batch1
61
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.inlab_play_human_batch2
62
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.shirt_rolling_task24_2000_human_video_filter_n6_keep1619_demo_stats
63
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.shirt_rolling_task15_2000_human_video_filter_n6_keep572_demo_stats
64
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.sort_cards_human_filter_n6_keep523_demo_stats_overwrite_left_side_stats
65
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.tong_task38_2000_human_video_overwrite_left_side_stats
66
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.syringe_task30i_2000_human_video_filtered
67
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.unscrew_bottle_task43_2000_human_video_fixed-duration
68
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.unscrew_Jim_bottle_task47_600_human_video
69
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.fold_shirt_task30b_500_human_video_halfdone
70
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.fold_towel_task30c_500_human_video_halfdone
71
+ - /mnt/aws-lfs-02/shared/datasets/galaxea_sharpa/real_r1_pro_sharpa.sort_cards_task32e_1000_human_video
72
+ dataset_type: physical_embodiment
73
+ embodiment_tag: real_r1_pro_sharpa_relative_eef_human
74
+ mix_ratio: 0.05
75
+ download_cache: false
76
+ episode_sampling_rate: 0.1
77
+ image_crop_size:
78
+ - 244
79
+ - 244
80
+ image_target_size:
81
+ - 224
82
+ - 224
83
+ max_prompt_trajectories: 5
84
+ mock_dataset_mode: false
85
+ modality_configs:
86
+ oxe_droid_relative_eef_relative_joint:
87
+ action: !!python/object:groot.vla.omni.data.types.ModalityConfig
88
+ action_format:
89
+ - &id004 !!python/object/apply:groot.vla.omni.data.types.ActionFormat
90
+ - xyz+rot6d
91
+ - &id001 !!python/object/apply:groot.vla.omni.data.types.ActionFormat
92
+ - default
93
+ - *id001
94
+ action_representation:
95
+ - &id002 !!python/object/apply:groot.vla.omni.data.types.ActionRepresentation
96
+ - relative
97
+ - &id005 !!python/object/apply:groot.vla.omni.data.types.ActionRepresentation
98
+ - absolute
99
+ - *id002
100
+ action_type:
101
+ - &id006 !!python/object/apply:groot.vla.omni.data.types.ActionType
102
+ - eef
103
+ - &id003 !!python/object/apply:groot.vla.omni.data.types.ActionType
104
+ - non_eef
105
+ - *id003
106
+ delta_indices:
107
+ - 0
108
+ - 1
109
+ - 2
110
+ - 3
111
+ - 4
112
+ - 5
113
+ - 6
114
+ - 7
115
+ - 8
116
+ - 9
117
+ - 10
118
+ - 11
119
+ - 12
120
+ - 13
121
+ - 14
122
+ - 15
123
+ - 16
124
+ - 17
125
+ - 18
126
+ - 19
127
+ - 20
128
+ - 21
129
+ - 22
130
+ - 23
131
+ - 24
132
+ - 25
133
+ - 26
134
+ - 27
135
+ - 28
136
+ - 29
137
+ - 30
138
+ - 31
139
+ - 32
140
+ - 33
141
+ - 34
142
+ - 35
143
+ - 36
144
+ - 37
145
+ - 38
146
+ - 39
147
+ exclude_state: false
148
+ extra_keys:
149
+ - joint_position
150
+ hand_keys:
151
+ - gripper_position
152
+ loss_weights: null
153
+ modality_keys:
154
+ - eef_9d
155
+ - gripper_position
156
+ - joint_position
157
+ normalization_mode: null
158
+ normalize_rotation: true
159
+ wrist_keys:
160
+ - eef_9d
161
+ language: !!python/object:groot.vla.omni.data.types.ModalityConfig
162
+ action_format: null
163
+ action_representation: null
164
+ action_type: null
165
+ delta_indices:
166
+ - 0
167
+ exclude_state: false
168
+ extra_keys: null
169
+ hand_keys: null
170
+ loss_weights: null
171
+ modality_keys:
172
+ - annotation.language.language_instruction
173
+ - annotation.language.language_instruction_2
174
+ - annotation.language.language_instruction_3
175
+ normalization_mode: null
176
+ normalize_rotation: true
177
+ wrist_keys: null
178
+ state: !!python/object:groot.vla.omni.data.types.ModalityConfig
179
+ action_format: null
180
+ action_representation: null
181
+ action_type: null
182
+ delta_indices:
183
+ - 0
184
+ exclude_state: false
185
+ extra_keys: null
186
+ hand_keys: null
187
+ loss_weights: null
188
+ modality_keys:
189
+ - eef_9d
190
+ - gripper_position
191
+ - joint_position
192
+ normalization_mode: null
193
+ normalize_rotation: true
194
+ wrist_keys: null
195
+ video: !!python/object:groot.vla.omni.data.types.ModalityConfig
196
+ action_format: null
197
+ action_representation: null
198
+ action_type: null
199
+ delta_indices:
200
+ - -15
201
+ - 0
202
+ exclude_state: false
203
+ extra_keys: null
204
+ hand_keys: null
205
+ loss_weights: null
206
+ modality_keys:
207
+ - exterior_image_1_left
208
+ - wrist_image_left
209
+ normalization_mode: null
210
+ normalize_rotation: true
211
+ wrist_keys: null
212
+ real_g1_relative_eef_relative_joints:
213
+ action: !!python/object:groot.vla.omni.data.types.ModalityConfig
214
+ action_format:
215
+ - *id004
216
+ - *id004
217
+ - *id001
218
+ - *id001
219
+ - *id001
220
+ - *id001
221
+ - *id001
222
+ - *id001
223
+ - *id001
224
+ action_representation:
225
+ - *id002
226
+ - *id002
227
+ - *id005
228
+ - *id005
229
+ - *id002
230
+ - *id002
231
+ - *id005
232
+ - *id005
233
+ - *id005
234
+ action_type:
235
+ - *id006
236
+ - *id006
237
+ - *id003
238
+ - *id003
239
+ - *id003
240
+ - *id003
241
+ - *id003
242
+ - *id003
243
+ - *id003
244
+ delta_indices:
245
+ - 0
246
+ - 1
247
+ - 2
248
+ - 3
249
+ - 4
250
+ - 5
251
+ - 6
252
+ - 7
253
+ - 8
254
+ - 9
255
+ - 10
256
+ - 11
257
+ - 12
258
+ - 13
259
+ - 14
260
+ - 15
261
+ - 16
262
+ - 17
263
+ - 18
264
+ - 19
265
+ - 20
266
+ - 21
267
+ - 22
268
+ - 23
269
+ - 24
270
+ - 25
271
+ - 26
272
+ - 27
273
+ - 28
274
+ - 29
275
+ - 30
276
+ - 31
277
+ - 32
278
+ - 33
279
+ - 34
280
+ - 35
281
+ - 36
282
+ - 37
283
+ - 38
284
+ - 39
285
+ exclude_state: false
286
+ extra_keys:
287
+ - left_arm
288
+ - right_arm
289
+ - waist
290
+ - base_height_command
291
+ - navigate_command
292
+ hand_keys:
293
+ - left_hand
294
+ - right_hand
295
+ loss_weights: null
296
+ modality_keys:
297
+ - left_wrist_eef_9d
298
+ - right_wrist_eef_9d
299
+ - left_hand
300
+ - right_hand
301
+ - left_arm
302
+ - right_arm
303
+ - waist
304
+ - base_height_command
305
+ - navigate_command
306
+ normalization_mode: null
307
+ normalize_rotation: true
308
+ wrist_keys:
309
+ - left_wrist_eef_9d
310
+ - right_wrist_eef_9d
311
+ language: !!python/object:groot.vla.omni.data.types.ModalityConfig
312
+ action_format: null
313
+ action_representation: null
314
+ action_type: null
315
+ delta_indices:
316
+ - 0
317
+ exclude_state: false
318
+ extra_keys: null
319
+ hand_keys: null
320
+ loss_weights: null
321
+ modality_keys:
322
+ - annotation.human.task_description
323
+ normalization_mode: null
324
+ normalize_rotation: true
325
+ wrist_keys: null
326
+ state: !!python/object:groot.vla.omni.data.types.ModalityConfig
327
+ action_format: null
328
+ action_representation: null
329
+ action_type: null
330
+ delta_indices:
331
+ - 0
332
+ exclude_state: false
333
+ extra_keys: null
334
+ hand_keys: null
335
+ loss_weights: null
336
+ modality_keys:
337
+ - left_wrist_eef_9d
338
+ - right_wrist_eef_9d
339
+ - left_hand
340
+ - right_hand
341
+ - left_arm
342
+ - right_arm
343
+ - waist
344
+ normalization_mode: null
345
+ normalize_rotation: true
346
+ wrist_keys: null
347
+ video: !!python/object:groot.vla.omni.data.types.ModalityConfig
348
+ action_format: null
349
+ action_representation: null
350
+ action_type: null
351
+ delta_indices:
352
+ - -20
353
+ - 0
354
+ exclude_state: false
355
+ extra_keys: null
356
+ hand_keys: null
357
+ loss_weights: null
358
+ modality_keys:
359
+ - ego_view
360
+ normalization_mode: null
361
+ normalize_rotation: true
362
+ wrist_keys: null
363
+ real_r1_pro_sharpa_relative_eef:
364
+ action: !!python/object:groot.vla.omni.data.types.ModalityConfig
365
+ action_format:
366
+ - *id004
367
+ - *id004
368
+ - *id001
369
+ - *id001
370
+ action_representation:
371
+ - *id002
372
+ - *id002
373
+ - *id005
374
+ - *id005
375
+ action_type:
376
+ - *id006
377
+ - *id006
378
+ - *id003
379
+ - *id003
380
+ delta_indices:
381
+ - 0
382
+ - 1
383
+ - 2
384
+ - 3
385
+ - 4
386
+ - 5
387
+ - 6
388
+ - 7
389
+ - 8
390
+ - 9
391
+ - 10
392
+ - 11
393
+ - 12
394
+ - 13
395
+ - 14
396
+ - 15
397
+ - 16
398
+ - 17
399
+ - 18
400
+ - 19
401
+ - 20
402
+ - 21
403
+ - 22
404
+ - 23
405
+ - 24
406
+ - 25
407
+ - 26
408
+ - 27
409
+ - 28
410
+ - 29
411
+ - 30
412
+ - 31
413
+ - 32
414
+ - 33
415
+ - 34
416
+ - 35
417
+ - 36
418
+ - 37
419
+ - 38
420
+ - 39
421
+ exclude_state: false
422
+ extra_keys: []
423
+ hand_keys:
424
+ - left_hand_joints
425
+ - right_hand_joints
426
+ loss_weights: null
427
+ modality_keys:
428
+ - left_wrist_eef
429
+ - right_wrist_eef
430
+ - left_hand_joints
431
+ - right_hand_joints
432
+ normalization_mode: null
433
+ normalize_rotation: true
434
+ wrist_keys:
435
+ - left_wrist_eef
436
+ - right_wrist_eef
437
+ language: !!python/object:groot.vla.omni.data.types.ModalityConfig
438
+ action_format: null
439
+ action_representation: null
440
+ action_type: null
441
+ delta_indices:
442
+ - 0
443
+ exclude_state: false
444
+ extra_keys: null
445
+ hand_keys: null
446
+ loss_weights: null
447
+ modality_keys:
448
+ - annotation.human.coarse_action
449
+ normalization_mode: null
450
+ normalize_rotation: true
451
+ wrist_keys: null
452
+ state: !!python/object:groot.vla.omni.data.types.ModalityConfig
453
+ action_format: null
454
+ action_representation: null
455
+ action_type: null
456
+ delta_indices:
457
+ - 0
458
+ exclude_state: false
459
+ extra_keys: null
460
+ hand_keys: null
461
+ loss_weights: null
462
+ modality_keys:
463
+ - left_wrist_eef
464
+ - right_wrist_eef
465
+ - left_hand_joints
466
+ - right_hand_joints
467
+ normalization_mode: null
468
+ normalize_rotation: true
469
+ wrist_keys: null
470
+ video: !!python/object:groot.vla.omni.data.types.ModalityConfig
471
+ action_format: null
472
+ action_representation: null
473
+ action_type: null
474
+ delta_indices:
475
+ - -20
476
+ - 0
477
+ exclude_state: false
478
+ extra_keys: null
479
+ hand_keys: null
480
+ loss_weights: null
481
+ modality_keys:
482
+ - ego_view_res320x240_freq20
483
+ - left_wrist_view_res320x240_freq20
484
+ - right_wrist_view_res320x240_freq20
485
+ normalization_mode: null
486
+ normalize_rotation: true
487
+ wrist_keys: null
488
+ real_r1_pro_sharpa_relative_eef_human:
489
+ action: !!python/object:groot.vla.omni.data.types.ModalityConfig
490
+ action_format:
491
+ - *id004
492
+ - *id004
493
+ - *id001
494
+ - *id001
495
+ action_representation:
496
+ - *id002
497
+ - *id002
498
+ - *id005
499
+ - *id005
500
+ action_type:
501
+ - *id006
502
+ - *id006
503
+ - *id003
504
+ - *id003
505
+ delta_indices:
506
+ - 0
507
+ - 1
508
+ - 2
509
+ - 3
510
+ - 4
511
+ - 5
512
+ - 6
513
+ - 7
514
+ - 8
515
+ - 9
516
+ - 10
517
+ - 11
518
+ - 12
519
+ - 13
520
+ - 14
521
+ - 15
522
+ - 16
523
+ - 17
524
+ - 18
525
+ - 19
526
+ - 20
527
+ - 21
528
+ - 22
529
+ - 23
530
+ - 24
531
+ - 25
532
+ - 26
533
+ - 27
534
+ - 28
535
+ - 29
536
+ - 30
537
+ - 31
538
+ - 32
539
+ - 33
540
+ - 34
541
+ - 35
542
+ - 36
543
+ - 37
544
+ - 38
545
+ - 39
546
+ exclude_state: false
547
+ extra_keys: []
548
+ hand_keys:
549
+ - left_hand_joints
550
+ - right_hand_joints
551
+ loss_weights: null
552
+ modality_keys:
553
+ - left_wrist_eef
554
+ - right_wrist_eef
555
+ - left_hand_joints
556
+ - right_hand_joints
557
+ normalization_mode: null
558
+ normalize_rotation: true
559
+ wrist_keys:
560
+ - left_wrist_eef
561
+ - right_wrist_eef
562
+ language: !!python/object:groot.vla.omni.data.types.ModalityConfig
563
+ action_format: null
564
+ action_representation: null
565
+ action_type: null
566
+ delta_indices:
567
+ - 0
568
+ exclude_state: false
569
+ extra_keys: null
570
+ hand_keys: null
571
+ loss_weights: null
572
+ modality_keys:
573
+ - annotation.human.coarse_action
574
+ normalization_mode: null
575
+ normalize_rotation: true
576
+ wrist_keys: null
577
+ state: !!python/object:groot.vla.omni.data.types.ModalityConfig
578
+ action_format: null
579
+ action_representation: null
580
+ action_type: null
581
+ delta_indices:
582
+ - 0
583
+ exclude_state: true
584
+ extra_keys: null
585
+ hand_keys: null
586
+ loss_weights: null
587
+ modality_keys:
588
+ - left_wrist_eef
589
+ - right_wrist_eef
590
+ - left_hand_joints
591
+ - right_hand_joints
592
+ normalization_mode: null
593
+ normalize_rotation: true
594
+ wrist_keys: null
595
+ video: !!python/object:groot.vla.omni.data.types.ModalityConfig
596
+ action_format: null
597
+ action_representation: null
598
+ action_type: null
599
+ delta_indices:
600
+ - -20
601
+ - 0
602
+ exclude_state: false
603
+ extra_keys: null
604
+ hand_keys: null
605
+ loss_weights: null
606
+ modality_keys:
607
+ - ego_view_res320x240_freq20
608
+ - left_wrist_view_res320x240_freq20
609
+ - right_wrist_view_res320x240_freq20
610
+ normalization_mode: null
611
+ normalize_rotation: true
612
+ wrist_keys: null
613
+ real_r1_pro_sharpa_relative_eef_maxinsights:
614
+ action: !!python/object:groot.vla.omni.data.types.ModalityConfig
615
+ action_format:
616
+ - *id004
617
+ - *id004
618
+ - *id001
619
+ - *id001
620
+ action_representation:
621
+ - *id002
622
+ - *id002
623
+ - *id005
624
+ - *id005
625
+ action_type:
626
+ - *id006
627
+ - *id006
628
+ - *id003
629
+ - *id003
630
+ delta_indices:
631
+ - 0
632
+ - 1
633
+ - 2
634
+ - 3
635
+ - 4
636
+ - 5
637
+ - 6
638
+ - 7
639
+ - 8
640
+ - 9
641
+ - 10
642
+ - 11
643
+ - 12
644
+ - 13
645
+ - 14
646
+ - 15
647
+ - 16
648
+ - 17
649
+ - 18
650
+ - 19
651
+ - 20
652
+ - 21
653
+ - 22
654
+ - 23
655
+ - 24
656
+ - 25
657
+ - 26
658
+ - 27
659
+ - 28
660
+ - 29
661
+ - 30
662
+ - 31
663
+ - 32
664
+ - 33
665
+ - 34
666
+ - 35
667
+ - 36
668
+ - 37
669
+ - 38
670
+ - 39
671
+ exclude_state: false
672
+ extra_keys: []
673
+ hand_keys:
674
+ - left_hand_joints
675
+ - right_hand_joints
676
+ loss_weights: null
677
+ modality_keys:
678
+ - left_wrist_eef
679
+ - right_wrist_eef
680
+ - left_hand_joints
681
+ - right_hand_joints
682
+ normalization_mode: null
683
+ normalize_rotation: true
684
+ wrist_keys:
685
+ - left_wrist_eef
686
+ - right_wrist_eef
687
+ language: !!python/object:groot.vla.omni.data.types.ModalityConfig
688
+ action_format: null
689
+ action_representation: null
690
+ action_type: null
691
+ delta_indices:
692
+ - 0
693
+ exclude_state: false
694
+ extra_keys: null
695
+ hand_keys: null
696
+ loss_weights: null
697
+ modality_keys:
698
+ - annotation.human.coarse_action
699
+ normalization_mode: null
700
+ normalize_rotation: true
701
+ wrist_keys: null
702
+ state: !!python/object:groot.vla.omni.data.types.ModalityConfig
703
+ action_format: null
704
+ action_representation: null
705
+ action_type: null
706
+ delta_indices:
707
+ - 0
708
+ exclude_state: true
709
+ extra_keys: null
710
+ hand_keys: null
711
+ loss_weights: null
712
+ modality_keys:
713
+ - left_wrist_eef
714
+ - right_wrist_eef
715
+ - left_hand_joints
716
+ - right_hand_joints
717
+ normalization_mode: null
718
+ normalize_rotation: true
719
+ wrist_keys: null
720
+ video: !!python/object:groot.vla.omni.data.types.ModalityConfig
721
+ action_format: null
722
+ action_representation: null
723
+ action_type: null
724
+ delta_indices:
725
+ - -30
726
+ - 0
727
+ exclude_state: false
728
+ extra_keys: null
729
+ hand_keys: null
730
+ loss_weights: null
731
+ modality_keys:
732
+ - ego_view_cropratio_res320x240_freq30
733
+ normalization_mode: null
734
+ normalize_rotation: true
735
+ wrist_keys: null
736
+ real_r1_pro_sharpa_relative_eef_mecka:
737
+ action: !!python/object:groot.vla.omni.data.types.ModalityConfig
738
+ action_format:
739
+ - *id004
740
+ - *id004
741
+ - *id001
742
+ - *id001
743
+ action_representation:
744
+ - *id002
745
+ - *id002
746
+ - *id005
747
+ - *id005
748
+ action_type:
749
+ - *id006
750
+ - *id006
751
+ - *id003
752
+ - *id003
753
+ delta_indices:
754
+ - 0
755
+ - 1
756
+ - 2
757
+ - 3
758
+ - 4
759
+ - 5
760
+ - 6
761
+ - 7
762
+ - 8
763
+ - 9
764
+ - 10
765
+ - 11
766
+ - 12
767
+ - 13
768
+ - 14
769
+ - 15
770
+ - 16
771
+ - 17
772
+ - 18
773
+ - 19
774
+ - 20
775
+ - 21
776
+ - 22
777
+ - 23
778
+ - 24
779
+ - 25
780
+ - 26
781
+ - 27
782
+ - 28
783
+ - 29
784
+ - 30
785
+ - 31
786
+ - 32
787
+ - 33
788
+ - 34
789
+ - 35
790
+ - 36
791
+ - 37
792
+ - 38
793
+ - 39
794
+ exclude_state: false
795
+ extra_keys: []
796
+ hand_keys:
797
+ - left_hand_joints
798
+ - right_hand_joints
799
+ loss_weights: null
800
+ modality_keys:
801
+ - left_wrist_eef
802
+ - right_wrist_eef
803
+ - left_hand_joints
804
+ - right_hand_joints
805
+ normalization_mode: null
806
+ normalize_rotation: true
807
+ wrist_keys:
808
+ - left_wrist_eef
809
+ - right_wrist_eef
810
+ language: !!python/object:groot.vla.omni.data.types.ModalityConfig
811
+ action_format: null
812
+ action_representation: null
813
+ action_type: null
814
+ delta_indices:
815
+ - 0
816
+ exclude_state: false
817
+ extra_keys: null
818
+ hand_keys: null
819
+ loss_weights: null
820
+ modality_keys:
821
+ - annotation.human.coarse_action
822
+ normalization_mode: null
823
+ normalize_rotation: true
824
+ wrist_keys: null
825
+ state: !!python/object:groot.vla.omni.data.types.ModalityConfig
826
+ action_format: null
827
+ action_representation: null
828
+ action_type: null
829
+ delta_indices:
830
+ - 0
831
+ exclude_state: true
832
+ extra_keys: null
833
+ hand_keys: null
834
+ loss_weights: null
835
+ modality_keys:
836
+ - left_wrist_eef
837
+ - right_wrist_eef
838
+ - left_hand_joints
839
+ - right_hand_joints
840
+ normalization_mode: null
841
+ normalize_rotation: true
842
+ wrist_keys: null
843
+ video: !!python/object:groot.vla.omni.data.types.ModalityConfig
844
+ action_format: null
845
+ action_representation: null
846
+ action_type: null
847
+ delta_indices:
848
+ - -30
849
+ - 0
850
+ exclude_state: false
851
+ extra_keys: null
852
+ hand_keys: null
853
+ loss_weights: null
854
+ modality_keys:
855
+ - ego_view_cropratio_res320x240_freq30
856
+ normalization_mode: null
857
+ normalize_rotation: true
858
+ wrist_keys: null
859
+ xdof_relative_eef_relative_joint:
860
+ action: !!python/object:groot.vla.omni.data.types.ModalityConfig
861
+ action_format:
862
+ - *id004
863
+ - *id004
864
+ - *id001
865
+ - *id001
866
+ - *id001
867
+ - *id001
868
+ action_representation:
869
+ - *id002
870
+ - *id002
871
+ - *id005
872
+ - *id005
873
+ - *id002
874
+ - *id002
875
+ action_type:
876
+ - *id006
877
+ - *id006
878
+ - *id003
879
+ - *id003
880
+ - *id003
881
+ - *id003
882
+ delta_indices:
883
+ - 0
884
+ - 1
885
+ - 2
886
+ - 3
887
+ - 4
888
+ - 5
889
+ - 6
890
+ - 7
891
+ - 8
892
+ - 9
893
+ - 10
894
+ - 11
895
+ - 12
896
+ - 13
897
+ - 14
898
+ - 15
899
+ - 16
900
+ - 17
901
+ - 18
902
+ - 19
903
+ - 20
904
+ - 21
905
+ - 22
906
+ - 23
907
+ - 24
908
+ - 25
909
+ - 26
910
+ - 27
911
+ - 28
912
+ - 29
913
+ - 30
914
+ - 31
915
+ - 32
916
+ - 33
917
+ - 34
918
+ - 35
919
+ - 36
920
+ - 37
921
+ - 38
922
+ - 39
923
+ exclude_state: false
924
+ extra_keys:
925
+ - left_joint_pos
926
+ - right_joint_pos
927
+ hand_keys:
928
+ - left_gripper_pos
929
+ - right_gripper_pos
930
+ loss_weights: null
931
+ modality_keys:
932
+ - left_wrist_eef
933
+ - right_wrist_eef
934
+ - left_gripper_pos
935
+ - right_gripper_pos
936
+ - left_joint_pos
937
+ - right_joint_pos
938
+ normalization_mode: null
939
+ normalize_rotation: true
940
+ wrist_keys:
941
+ - left_wrist_eef
942
+ - right_wrist_eef
943
+ language: !!python/object:groot.vla.omni.data.types.ModalityConfig
944
+ action_format: null
945
+ action_representation: null
946
+ action_type: null
947
+ delta_indices:
948
+ - 0
949
+ exclude_state: false
950
+ extra_keys: null
951
+ hand_keys: null
952
+ loss_weights: null
953
+ modality_keys:
954
+ - annotation.task
955
+ normalization_mode: null
956
+ normalize_rotation: true
957
+ wrist_keys: null
958
+ state: !!python/object:groot.vla.omni.data.types.ModalityConfig
959
+ action_format: null
960
+ action_representation: null
961
+ action_type: null
962
+ delta_indices:
963
+ - 0
964
+ exclude_state: false
965
+ extra_keys: null
966
+ hand_keys: null
967
+ loss_weights: null
968
+ modality_keys:
969
+ - left_wrist_eef
970
+ - right_wrist_eef
971
+ - left_gripper_pos
972
+ - right_gripper_pos
973
+ - left_joint_pos
974
+ - right_joint_pos
975
+ normalization_mode: null
976
+ normalize_rotation: true
977
+ wrist_keys: null
978
+ video: !!python/object:groot.vla.omni.data.types.ModalityConfig
979
+ action_format: null
980
+ action_representation: null
981
+ action_type: null
982
+ delta_indices:
983
+ - -30
984
+ - 0
985
+ exclude_state: false
986
+ extra_keys: null
987
+ hand_keys: null
988
+ loss_weights: null
989
+ modality_keys:
990
+ - top_camera-images-rgb_320_240
991
+ - left_camera-images-rgb_320_240
992
+ - right_camera-images-rgb_320_240
993
+ normalization_mode: null
994
+ normalize_rotation: true
995
+ wrist_keys: null
996
+ xdof_relative_eef_relative_joint_subtask:
997
+ action: !!python/object:groot.vla.omni.data.types.ModalityConfig
998
+ action_format:
999
+ - *id004
1000
+ - *id004
1001
+ - *id001
1002
+ - *id001
1003
+ - *id001
1004
+ - *id001
1005
+ action_representation:
1006
+ - *id002
1007
+ - *id002
1008
+ - *id005
1009
+ - *id005
1010
+ - *id002
1011
+ - *id002
1012
+ action_type:
1013
+ - *id006
1014
+ - *id006
1015
+ - *id003
1016
+ - *id003
1017
+ - *id003
1018
+ - *id003
1019
+ delta_indices:
1020
+ - 0
1021
+ - 1
1022
+ - 2
1023
+ - 3
1024
+ - 4
1025
+ - 5
1026
+ - 6
1027
+ - 7
1028
+ - 8
1029
+ - 9
1030
+ - 10
1031
+ - 11
1032
+ - 12
1033
+ - 13
1034
+ - 14
1035
+ - 15
1036
+ - 16
1037
+ - 17
1038
+ - 18
1039
+ - 19
1040
+ - 20
1041
+ - 21
1042
+ - 22
1043
+ - 23
1044
+ - 24
1045
+ - 25
1046
+ - 26
1047
+ - 27
1048
+ - 28
1049
+ - 29
1050
+ - 30
1051
+ - 31
1052
+ - 32
1053
+ - 33
1054
+ - 34
1055
+ - 35
1056
+ - 36
1057
+ - 37
1058
+ - 38
1059
+ - 39
1060
+ exclude_state: false
1061
+ extra_keys:
1062
+ - left_joint_pos
1063
+ - right_joint_pos
1064
+ hand_keys:
1065
+ - left_gripper_pos
1066
+ - right_gripper_pos
1067
+ loss_weights: null
1068
+ modality_keys:
1069
+ - left_wrist_eef
1070
+ - right_wrist_eef
1071
+ - left_gripper_pos
1072
+ - right_gripper_pos
1073
+ - left_joint_pos
1074
+ - right_joint_pos
1075
+ normalization_mode: null
1076
+ normalize_rotation: true
1077
+ wrist_keys:
1078
+ - left_wrist_eef
1079
+ - right_wrist_eef
1080
+ language: !!python/object:groot.vla.omni.data.types.ModalityConfig
1081
+ action_format: null
1082
+ action_representation: null
1083
+ action_type: null
1084
+ delta_indices:
1085
+ - 0
1086
+ exclude_state: false
1087
+ extra_keys: null
1088
+ hand_keys: null
1089
+ loss_weights: null
1090
+ modality_keys:
1091
+ - annotation.sub_task
1092
+ normalization_mode: null
1093
+ normalize_rotation: true
1094
+ wrist_keys: null
1095
+ state: !!python/object:groot.vla.omni.data.types.ModalityConfig
1096
+ action_format: null
1097
+ action_representation: null
1098
+ action_type: null
1099
+ delta_indices:
1100
+ - 0
1101
+ exclude_state: false
1102
+ extra_keys: null
1103
+ hand_keys: null
1104
+ loss_weights: null
1105
+ modality_keys:
1106
+ - left_wrist_eef
1107
+ - right_wrist_eef
1108
+ - left_gripper_pos
1109
+ - right_gripper_pos
1110
+ - left_joint_pos
1111
+ - right_joint_pos
1112
+ normalization_mode: null
1113
+ normalize_rotation: true
1114
+ wrist_keys: null
1115
+ video: !!python/object:groot.vla.omni.data.types.ModalityConfig
1116
+ action_format: null
1117
+ action_representation: null
1118
+ action_type: null
1119
+ delta_indices:
1120
+ - -30
1121
+ - 0
1122
+ exclude_state: false
1123
+ extra_keys: null
1124
+ hand_keys: null
1125
+ loss_weights: null
1126
+ modality_keys:
1127
+ - top_camera-images-rgb_320_240
1128
+ - left_camera-images-rgb_320_240
1129
+ - right_camera-images-rgb_320_240
1130
+ normalization_mode: null
1131
+ normalize_rotation: true
1132
+ wrist_keys: null
1133
+ mode: single_turn
1134
+ num_prompt_trajectories: 2
1135
+ num_shards_per_epoch: 100000
1136
+ override_pretraining_statistics: false
1137
+ random_chop: 0.0
1138
+ seed: 24
1139
+ shard_size: 1024
1140
+ shuffle: true
1141
+ subsample_ratio: 1.0
1142
+ variable_num_demos: false
1143
+ video_backend: torchcodec
1144
+ load_config_path: groot/vla/omni/configs/experiments/r1_pro/sharpa/n17_pretrain/n17_pretrain_human_robot_cross_embodiment_fix_yam_absolute_hand_2step.yaml
1145
+ model: !!python/object:groot.vla.omni.configs.model.groot_n1d5_qwen.GrootN1d5QwenConfig
1146
+ _attn_implementation_internal: null
1147
+ _commit_hash: null
1148
+ _name_or_path: ''
1149
+ _output_attentions: false
1150
+ action_horizon: 40
1151
+ action_space_prompt: false
1152
+ add_cross_attention: false
1153
+ add_pos_embed: true
1154
+ apply_sincos_state_encoding: false
1155
+ architectures: null
1156
+ attn_dropout: 0.2
1157
+ backbone_embedding_dim: 2048
1158
+ bad_words_ids: null
1159
+ begin_suppress_tokens: null
1160
+ bos_token_id: null
1161
+ chunk_size_feed_forward: 0
1162
+ color_jitter_params:
1163
+ brightness: 0.3
1164
+ contrast: 0.4
1165
+ hue: 0.08
1166
+ saturation: 0.5
1167
+ crop_fraction: 0.95
1168
+ cross_attention_hidden_size: null
1169
+ decoder_start_token_id: null
1170
+ diffusion_model_cfg:
1171
+ attention_head_dim: 48
1172
+ cross_attention_dim: 2048
1173
+ dropout: 0.2
1174
+ final_dropout: true
1175
+ interleave_self_attention: true
1176
+ norm_type: ada_norm
1177
+ num_attention_heads: 32
1178
+ num_layers: 32
1179
+ output_dim: 1024
1180
+ positional_embeddings: null
1181
+ dit_latent_dim: 1536
1182
+ diversity_penalty: 0.0
1183
+ do_human_interpolation: false
1184
+ do_sample: false
1185
+ dtype: null
1186
+ early_stopping: false
1187
+ encoder_no_repeat_ngram_size: 0
1188
+ eos_token_id: null
1189
+ exclude_state: false
1190
+ exponential_decay_length_penalty: null
1191
+ finetuning_task: null
1192
+ forced_bos_token_id: null
1193
+ forced_eos_token_id: null
1194
+ formalize_language: true
1195
+ hidden_size: 1024
1196
+ human_embodiment_tags: null
1197
+ id2label:
1198
+ 0: LABEL_0
1199
+ 1: LABEL_1
1200
+ image_crop_size: !!python/tuple
1201
+ - 230
1202
+ - 230
1203
+ image_target_size: !!python/tuple
1204
+ - 256
1205
+ - 256
1206
+ interpolation_steps: 20
1207
+ is_decoder: false
1208
+ is_encoder_decoder: false
1209
+ label2id:
1210
+ LABEL_0: 0
1211
+ LABEL_1: 1
1212
+ language_dropout_prob: 0.0
1213
+ length_penalty: 1.0
1214
+ letter_box_transform: false
1215
+ load_bf16: true
1216
+ max_action_dim: 132
1217
+ max_length: 20
1218
+ max_num_embodiments: 32
1219
+ max_seq_len: 1024
1220
+ max_state_dim: 132
1221
+ min_length: 0
1222
+ model_dtype: bfloat16
1223
+ model_type: GrootN1d5Qwen
1224
+ no_repeat_ngram_size: 0
1225
+ noise_beta_alpha: 1.5
1226
+ noise_beta_beta: 1.0
1227
+ noise_s: 0.999
1228
+ num_beam_groups: 1
1229
+ num_beams: 1
1230
+ num_inference_timesteps: 4
1231
+ num_return_sequences: 1
1232
+ num_timestep_buckets: 1000
1233
+ output_hidden_states: false
1234
+ output_scores: false
1235
+ pad_token_id: null
1236
+ prefix: null
1237
+ problem_type: null
1238
+ pruned_heads: {}
1239
+ random_history_crop: true
1240
+ random_rotation_angle: 0
1241
+ remove_invalid_values: false
1242
+ repetition_penalty: 1.0
1243
+ reproject_vision: false
1244
+ return_dict: true
1245
+ return_dict_in_generate: false
1246
+ rtc_ramp_rate: 6.0
1247
+ select_layer: 16
1248
+ sep_token_id: null
1249
+ shortest_image_edge: 256
1250
+ state_dropout_prob: 0.2
1251
+ state_gaussian_noise_std: 0.0
1252
+ suppress_tokens: null
1253
+ task_specific_params: null
1254
+ temperature: 1.0
1255
+ tf_legacy_loss: false
1256
+ tie_encoder_decoder: false
1257
+ tie_word_embeddings: true
1258
+ tokenizer_class: null
1259
+ top_k: 50
1260
+ top_p: 1.0
1261
+ torchscript: false
1262
+ transformers_version: null
1263
+ tune_diffusion_model: true
1264
+ tune_linear: true
1265
+ tune_llm: false
1266
+ tune_projector: true
1267
+ tune_top_llm_layers: 0
1268
+ tune_visual: false
1269
+ tune_vlln: true
1270
+ typical_p: 1.0
1271
+ use_albumentations: true
1272
+ use_alternate_vl_dit: true
1273
+ use_bfloat16: false
1274
+ use_flash_attention: true
1275
+ use_future_tokens: false
1276
+ use_mean_std: false
1277
+ use_percentiles: true
1278
+ use_vl_self_attention: true
1279
+ use_vlln: true
1280
+ vl_self_attention_cfg:
1281
+ attention_head_dim: 64
1282
+ dropout: 0.2
1283
+ final_dropout: true
1284
+ num_attention_heads: 32
1285
+ num_layers: 4
1286
+ positional_embeddings: null
1287
+ vlm_backend: qwen3
1288
+ vlm_model_path: nvidia/Cosmos-Reason2-2B
1289
+ training: !!python/object:groot.vla.omni.configs.training.training_config.TrainingConfig
1290
+ assert_loss_less_than: null
1291
+ batch_size: 32
1292
+ bf16: true
1293
+ dataloader_num_workers: 4
1294
+ deepspeed_stage: 2
1295
+ enable_profiling: false
1296
+ eval_batch_size: 2
1297
+ eval_bf16: true
1298
+ eval_set_split_ratio: 0.1
1299
+ eval_steps: 500
1300
+ eval_strategy: 'no'
1301
+ experiment_name: null
1302
+ fp16: false
1303
+ global_batch_size: 1024
1304
+ gradient_accumulation_steps: 1
1305
+ gradient_checkpointing: false
1306
+ learning_rate: 5.0e-05
1307
+ logging_steps: 10
1308
+ lr_scheduler_type: cosine
1309
+ max_concurrent_uploads: 2
1310
+ max_grad_norm: 1.0
1311
+ max_retries: 3
1312
+ max_steps: 200000
1313
+ muon_lr: 0.005
1314
+ num_gpus: 256
1315
+ optim: adamw_torch_fused
1316
+ output_dir: nvidia/Cosmos-Reason2-2B
1317
+ remove_unused_columns: false
1318
+ save_best_eval_metric_greater_is_better: true
1319
+ save_best_eval_metric_name: ''
1320
+ save_steps: 1000
1321
+ save_total_limit: 5
1322
+ save_vl_model: false
1323
+ skip_spike: true
1324
+ skip_spike_ema_alpha: 0.99
1325
+ skip_spike_max_consecutive: 10
1326
+ skip_spike_threshold: 5.0
1327
+ start_from_checkpoint: null
1328
+ tf32: true
1329
+ upload_checkpoints: true
1330
+ upload_every: 1000
1331
+ upload_last_n_checkpoints: 5
1332
+ use_ddp: false
1333
+ use_legacy_wd_application: false
1334
+ use_muon: false
1335
+ use_wandb: true
1336
+ wandb_project: human_pretraining_n15_galaxea_sharpa
1337
+ warmup_ratio: 0.05
1338
+ warmup_steps: 0
1339
+ weight_decay: 1.0e-05
1340
+ wsd_decay_type: cosine
1341
+ wsd_stable_ratio: 0.8
experiment_cfg/dataset_statistics.json ADDED
The diff for this file is too large to render. See raw diff
 
experiment_cfg/final_model_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "Gr00tN1d7",
3
+ "model_dtype": "bfloat16",
4
+ "model_name": "nvidia/Cosmos-Reason2-2B",
5
+ "backbone_model_type": "qwen",
6
+ "model_revision": null,
7
+ "tune_top_llm_layers": 4,
8
+ "backbone_embedding_dim": 2048,
9
+ "tune_llm": true,
10
+ "tune_visual": true,
11
+ "select_layer": 16,
12
+ "reproject_vision": false,
13
+ "use_flash_attention": true,
14
+ "load_bf16": true,
15
+ "collator_overwrite_image_inputs": false,
16
+ "eagle_collator": false,
17
+ "backbone_trainable_params_fp32": true,
18
+ "gemma_collator": false,
19
+ "apply_sincos_state_encoding": true,
20
+ "use_percentiles": false,
21
+ "use_relative_action": true,
22
+ "max_state_dim": 128,
23
+ "max_action_dim": 128,
24
+ "action_horizon": 50,
25
+ "hidden_size": 1024,
26
+ "input_embedding_dim": 1536,
27
+ "state_history_length": 1,
28
+ "add_pos_embed": true,
29
+ "attn_dropout": 0.2,
30
+ "use_vlln": true,
31
+ "max_seq_len": 1024,
32
+ "use_alternate_vl_dit": true,
33
+ "attend_text_every_n_blocks": 2,
34
+ "diffusion_model_cfg": {
35
+ "positional_embeddings": null,
36
+ "num_layers": 32,
37
+ "num_attention_heads": 32,
38
+ "attention_head_dim": 48,
39
+ "norm_type": "ada_norm",
40
+ "dropout": 0.2,
41
+ "final_dropout": true,
42
+ "output_dim": 1024,
43
+ "interleave_self_attention": true
44
+ },
45
+ "num_inference_timesteps": 4,
46
+ "noise_beta_alpha": 1.5,
47
+ "noise_beta_beta": 1.0,
48
+ "noise_s": 0.999,
49
+ "num_timestep_buckets": 1000,
50
+ "tune_projector": true,
51
+ "tune_diffusion_model": true,
52
+ "tune_vlln": true,
53
+ "state_dropout_prob": 0.0,
54
+ "state_additive_noise_scale": 0.0,
55
+ "max_num_embodiments": 32
56
+ }
experiment_cfg/final_processor_config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "modality_configs": "{'xdof': {'video': ModalityConfig(delta_indices=[0], modality_keys=['left_camera-images-rgb_320_240', 'top_camera-images-rgb_320_240', 'right_camera-images-rgb_320_240'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['gripper_pos_obs_left', 'gripper_pos_obs_right', 'joint_pos_obs_left', 'joint_pos_obs_right'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], modality_keys=['gripper_pos_action_left', 'gripper_pos_action_right', 'joint_pos_action_left', 'joint_pos_action_right'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=[ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key='gripper_pos_obs_left'), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key='gripper_pos_obs_right'), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key='joint_pos_obs_left'), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key='joint_pos_obs_right')]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.task'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'unitree_g1_full_body_with_waist_height_nav_cmd': {'video': ModalityConfig(delta_indices=[0], modality_keys=['ego_view'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['left_leg', 'right_leg', 'waist', 'left_arm', 'right_arm', 'left_hand', 'right_hand'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], modality_keys=['left_arm', 'right_arm', 'left_hand', 'right_hand', 'waist', 'base_height_command', 'navigate_command'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=[ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.human.task_description'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'oxe_droid_joint_position_relative': {'video': ModalityConfig(delta_indices=[0], modality_keys=['exterior_image_1_left', 'wrist_image_left'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['joint_position', 'gripper_position'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], modality_keys=['joint_position', 'gripper_position'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=[ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.language.language_instruction', 'annotation.language.language_instruction_2', 'annotation.language.language_instruction_3'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'simpler_env_google': {'video': ModalityConfig(delta_indices=[0], modality_keys=['image'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['x', 'y', 'z', 'rx', 'ry', 'rz', 'rw', 'gripper'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7], modality_keys=['x', 'y', 'z', 'roll', 'pitch', 'yaw', 'gripper'], sin_cos_embedding_keys=None, mean_std_embedding_keys=['x', 'y', 'z', 'roll', 'pitch', 'yaw'], action_configs=[ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.human.action.task_description'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'robocasa_panda_omron': {'video': ModalityConfig(delta_indices=[0], modality_keys=['res256_image_side_0', 'res256_image_side_1', 'res256_image_wrist_0'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['end_effector_position_relative', 'end_effector_rotation_relative', 'gripper_qpos', 'base_position', 'base_rotation'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], modality_keys=['end_effector_position', 'end_effector_rotation', 'gripper_close', 'base_motion', 'control_mode'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=[ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.human.action.task_description'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'gr1_unified': {'video': ModalityConfig(delta_indices=[0], modality_keys=['ego_view_bg_crop_pad_res256_freq20'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['left_arm', 'right_arm', 'left_hand', 'right_hand', 'waist'], sin_cos_embedding_keys=['left_arm', 'right_arm', 'left_hand', 'right_hand', 'waist'], mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], modality_keys=['left_arm', 'right_arm', 'left_hand', 'right_hand', 'waist'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=[ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['task'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'rl_info': ModalityConfig(delta_indices=[0], modality_keys=[], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'sim_behavior_r1_pro': {'video': ModalityConfig(delta_indices=[0], modality_keys=['observation.images.rgb.head_256_256', 'observation.images.rgb.left_wrist_256_256', 'observation.images.rgb.right_wrist_256_256'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['robot_pos', 'robot_ori_cos', 'robot_ori_sin', 'robot_2d_ori', 'robot_2d_ori_cos', 'robot_2d_ori_sin', 'robot_lin_vel', 'robot_ang_vel', 'arm_left_qpos', 'arm_left_qpos_sin', 'arm_left_qpos_cos', 'eef_left_pos', 'eef_left_quat', 'gripper_left_qpos', 'arm_right_qpos', 'arm_right_qpos_sin', 'arm_right_qpos_cos', 'eef_right_pos', 'eef_right_quat', 'gripper_right_qpos', 'trunk_qpos'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], modality_keys=['base', 'torso', 'left_arm', 'left_gripper', 'right_arm', 'right_gripper'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=[ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key='trunk_qpos'), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key='arm_left_qpos'), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key='arm_right_qpos'), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.human.coarse_action'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'simpler_env_widowx': {'video': ModalityConfig(delta_indices=[0], modality_keys=['image_0'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['x', 'y', 'z', 'roll', 'pitch', 'yaw', 'pad', 'gripper'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7], modality_keys=['x', 'y', 'z', 'roll', 'pitch', 'yaw', 'gripper'], sin_cos_embedding_keys=None, mean_std_embedding_keys=['x', 'y', 'z', 'roll', 'pitch', 'yaw'], action_configs=[ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.human.action.task_description'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'agibot': {'video': ModalityConfig(delta_indices=[0], modality_keys=['top_head_pad_res256_freq10', 'hand_left_pad_res256_freq10', 'hand_right_pad_res256_freq10'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['left_arm_joint_position', 'right_arm_joint_position', 'left_effector_position', 'right_effector_position', 'head_position', 'waist_pitch', 'waist_lift'], sin_cos_embedding_keys=['left_arm_joint_position', 'right_arm_joint_position', 'head_position', 'waist_pitch'], mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], modality_keys=['left_arm_joint_position', 'right_arm_joint_position', 'left_effector_position', 'right_effector_position', 'head_position', 'waist_pitch', 'waist_lift', 'robot_velocity'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=[ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.language.action_text'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}}",
3
+ "state_action_processor": "StateActionProcessor(modality_configs={'xdof': {'video': ModalityConfig(delta_indices=[0], modality_keys=['left_camera-images-rgb_320_240', 'top_camera-images-rgb_320_240', 'right_camera-images-rgb_320_240'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['gripper_pos_obs_left', 'gripper_pos_obs_right', 'joint_pos_obs_left', 'joint_pos_obs_right'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], modality_keys=['gripper_pos_action_left', 'gripper_pos_action_right', 'joint_pos_action_left', 'joint_pos_action_right'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=[ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key='gripper_pos_obs_left'), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key='gripper_pos_obs_right'), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key='joint_pos_obs_left'), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key='joint_pos_obs_right')]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.task'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'unitree_g1_full_body_with_waist_height_nav_cmd': {'video': ModalityConfig(delta_indices=[0], modality_keys=['ego_view'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['left_leg', 'right_leg', 'waist', 'left_arm', 'right_arm', 'left_hand', 'right_hand'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], modality_keys=['left_arm', 'right_arm', 'left_hand', 'right_hand', 'waist', 'base_height_command', 'navigate_command'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=[ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.human.task_description'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'oxe_droid_joint_position_relative': {'video': ModalityConfig(delta_indices=[0], modality_keys=['exterior_image_1_left', 'wrist_image_left'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['joint_position', 'gripper_position'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], modality_keys=['joint_position', 'gripper_position'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=[ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.language.language_instruction', 'annotation.language.language_instruction_2', 'annotation.language.language_instruction_3'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'simpler_env_google': {'video': ModalityConfig(delta_indices=[0], modality_keys=['image'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['x', 'y', 'z', 'rx', 'ry', 'rz', 'rw', 'gripper'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7], modality_keys=['x', 'y', 'z', 'roll', 'pitch', 'yaw', 'gripper'], sin_cos_embedding_keys=None, mean_std_embedding_keys=['x', 'y', 'z', 'roll', 'pitch', 'yaw'], action_configs=[ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.human.action.task_description'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'robocasa_panda_omron': {'video': ModalityConfig(delta_indices=[0], modality_keys=['res256_image_side_0', 'res256_image_side_1', 'res256_image_wrist_0'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['end_effector_position_relative', 'end_effector_rotation_relative', 'gripper_qpos', 'base_position', 'base_rotation'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], modality_keys=['end_effector_position', 'end_effector_rotation', 'gripper_close', 'base_motion', 'control_mode'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=[ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.human.action.task_description'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'gr1_unified': {'video': ModalityConfig(delta_indices=[0], modality_keys=['ego_view_bg_crop_pad_res256_freq20'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['left_arm', 'right_arm', 'left_hand', 'right_hand', 'waist'], sin_cos_embedding_keys=['left_arm', 'right_arm', 'left_hand', 'right_hand', 'waist'], mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], modality_keys=['left_arm', 'right_arm', 'left_hand', 'right_hand', 'waist'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=[ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['task'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'rl_info': ModalityConfig(delta_indices=[0], modality_keys=[], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'sim_behavior_r1_pro': {'video': ModalityConfig(delta_indices=[0], modality_keys=['observation.images.rgb.head_256_256', 'observation.images.rgb.left_wrist_256_256', 'observation.images.rgb.right_wrist_256_256'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['robot_pos', 'robot_ori_cos', 'robot_ori_sin', 'robot_2d_ori', 'robot_2d_ori_cos', 'robot_2d_ori_sin', 'robot_lin_vel', 'robot_ang_vel', 'arm_left_qpos', 'arm_left_qpos_sin', 'arm_left_qpos_cos', 'eef_left_pos', 'eef_left_quat', 'gripper_left_qpos', 'arm_right_qpos', 'arm_right_qpos_sin', 'arm_right_qpos_cos', 'eef_right_pos', 'eef_right_quat', 'gripper_right_qpos', 'trunk_qpos'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], modality_keys=['base', 'torso', 'left_arm', 'left_gripper', 'right_arm', 'right_gripper'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=[ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key='trunk_qpos'), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key='arm_left_qpos'), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key='arm_right_qpos'), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.human.coarse_action'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'simpler_env_widowx': {'video': ModalityConfig(delta_indices=[0], modality_keys=['image_0'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['x', 'y', 'z', 'roll', 'pitch', 'yaw', 'pad', 'gripper'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7], modality_keys=['x', 'y', 'z', 'roll', 'pitch', 'yaw', 'gripper'], sin_cos_embedding_keys=None, mean_std_embedding_keys=['x', 'y', 'z', 'roll', 'pitch', 'yaw'], action_configs=[ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.human.action.task_description'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}, 'agibot': {'video': ModalityConfig(delta_indices=[0], modality_keys=['top_head_pad_res256_freq10', 'hand_left_pad_res256_freq10', 'hand_right_pad_res256_freq10'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None), 'state': ModalityConfig(delta_indices=[0], modality_keys=['left_arm_joint_position', 'right_arm_joint_position', 'left_effector_position', 'right_effector_position', 'head_position', 'waist_pitch', 'waist_lift'], sin_cos_embedding_keys=['left_arm_joint_position', 'right_arm_joint_position', 'head_position', 'waist_pitch'], mean_std_embedding_keys=None, action_configs=None), 'action': ModalityConfig(delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], modality_keys=['left_arm_joint_position', 'right_arm_joint_position', 'left_effector_position', 'right_effector_position', 'head_position', 'waist_pitch', 'waist_lift', 'robot_velocity'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=[ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.RELATIVE: 'relative'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None), ActionConfig(rep=<ActionRepresentation.ABSOLUTE: 'absolute'>, type=<ActionType.NON_EEF: 'non_eef'>, format=<ActionFormat.DEFAULT: 'default'>, state_key=None)]), 'language': ModalityConfig(delta_indices=[0], modality_keys=['annotation.language.action_text'], sin_cos_embedding_keys=None, mean_std_embedding_keys=None, action_configs=None)}}, statistics={}, use_percentiles=False, clip_outliers=True, apply_sincos_state_encoding=True, use_relative_action=True)",
4
+ "use_percentiles": "False",
5
+ "clip_outliers": "True",
6
+ "apply_sincos_state_encoding": "True",
7
+ "use_relative_action": "True",
8
+ "formalize_language": "True",
9
+ "model_name": "nvidia/Cosmos-Reason2-2B",
10
+ "model_type": "qwen",
11
+ "processor": "Qwen3VLProcessor:\n- image_processor: Qwen2VLImageProcessorFast {\n \"crop_size\": null,\n \"data_format\": \"channels_first\",\n \"default_to_square\": true,\n \"device\": null,\n \"disable_grouping\": null,\n \"do_center_crop\": null,\n \"do_convert_rgb\": true,\n \"do_normalize\": true,\n \"do_pad\": null,\n \"do_rescale\": true,\n \"do_resize\": true,\n \"image_mean\": [\n 0.5,\n 0.5,\n 0.5\n ],\n \"image_processor_type\": \"Qwen2VLImageProcessorFast\",\n \"image_std\": [\n 0.5,\n 0.5,\n 0.5\n ],\n \"input_data_format\": null,\n \"max_pixels\": null,\n \"merge_size\": 2,\n \"min_pixels\": null,\n \"pad_size\": null,\n \"patch_size\": 16,\n \"processor_class\": \"Qwen3VLProcessor\",\n \"resample\": 3,\n \"rescale_factor\": 0.00392156862745098,\n \"return_tensors\": null,\n \"size\": {\n \"longest_edge\": 16777216,\n \"shortest_edge\": 65536\n },\n \"temporal_patch_size\": 2\n}\n\n- tokenizer: Qwen2TokenizerFast(name_or_path='nvidia/Cosmos-Reason2-2B', vocab_size=151643, model_max_length=262144, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={\n\t151643: AddedToken(\"<|endoftext|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t151644: AddedToken(\"<|im_start|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t151645: AddedToken(\"<|im_end|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t151646: AddedToken(\"<|object_ref_start|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t151647: AddedToken(\"<|object_ref_end|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t151648: AddedToken(\"<|box_start|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t151649: AddedToken(\"<|box_end|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t151650: AddedToken(\"<|quad_start|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t151651: AddedToken(\"<|quad_end|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t151652: AddedToken(\"<|vision_start|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t151653: AddedToken(\"<|vision_end|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t151654: AddedToken(\"<|vision_pad|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t151655: AddedToken(\"<|image_pad|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t151656: AddedToken(\"<|video_pad|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t151657: AddedToken(\"<tool_call>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),\n\t151658: AddedToken(\"</tool_call>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),\n\t151659: AddedToken(\"<|fim_prefix|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),\n\t151660: AddedToken(\"<|fim_middle|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),\n\t151661: AddedToken(\"<|fim_suffix|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),\n\t151662: AddedToken(\"<|fim_pad|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),\n\t151663: AddedToken(\"<|repo_name|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),\n\t151664: AddedToken(\"<|file_sep|>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),\n\t151665: AddedToken(\"<tool_response>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),\n\t151666: AddedToken(\"</tool_response>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),\n\t151667: AddedToken(\"<think>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),\n\t151668: AddedToken(\"</think>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),\n}\n)\n- video_processor: Qwen3VLVideoProcessor {\n \"crop_size\": null,\n \"data_format\": \"channels_first\",\n \"default_to_square\": true,\n \"device\": null,\n \"do_center_crop\": null,\n \"do_convert_rgb\": true,\n \"do_normalize\": true,\n \"do_rescale\": true,\n \"do_resize\": true,\n \"do_sample_frames\": true,\n \"fps\": 2,\n \"image_mean\": [\n 0.5,\n 0.5,\n 0.5\n ],\n \"image_std\": [\n 0.5,\n 0.5,\n 0.5\n ],\n \"input_data_format\": null,\n \"max_frames\": 768,\n \"merge_size\": 2,\n \"min_frames\": 4,\n \"num_frames\": null,\n \"pad_size\": null,\n \"patch_size\": 16,\n \"processor_class\": \"Qwen3VLProcessor\",\n \"resample\": 3,\n \"rescale_factor\": 0.00392156862745098,\n \"return_metadata\": false,\n \"size\": {\n \"longest_edge\": 25165824,\n \"shortest_edge\": 4096\n },\n \"temporal_patch_size\": 2,\n \"video_metadata\": null,\n \"video_processor_type\": \"Qwen3VLVideoProcessor\"\n}\n\n\n{\n \"processor_class\": \"Qwen3VLProcessor\"\n}\n",
12
+ "max_state_dim": "128",
13
+ "max_action_dim": "128",
14
+ "max_action_horizon": "50",
15
+ "image_crop_size": "None",
16
+ "image_target_size": "None",
17
+ "random_rotation_angle": "None",
18
+ "color_jitter_params": "{'brightness': 0.1, 'contrast': 0.1, 'saturation': 0.1, 'hue': 0.1}",
19
+ "embodiment_id_mapping": "{'robocasa_panda_omron': 13, 'oxe_droid': 17, 'oxe_droid_joint_position_relative': 17, 'oxe_fractal': 18, 'oxe_language_table': 19, 'oxe_bridge': 20, 'unknown': 22, 'gr1_unified': 20, 'agibot': 26, 'oxe_mutex': 28, 'oxe_roboset': 29, 'oxe_plex': 30, 'dream': 31, 'language_table_sim': 7, 'gr1_isaac': 0, 'xdof': 23, 'xdof_oss_data': 27, 'xdof_h16': 23, 'sim_behavior_r1_pro': 24, 'unitree_g1_full_body_with_waist_height_nav_cmd': 25, 'unitree_g1_full_body_with_waist_height_nav_cmd_sim': 8, 'unitree_g1_whole_body_teleop_latent': 9, 'unitree_g1_whole_body_teleop_smpl': 16, 'simpler_env_google': 0, 'simpler_env_widowx': 1, 'libero_sim': 2, 'droid_sim': 3, 'real_r1_pro_sharpa': 8, 'r1_pro': 27, 'r1_pro_single-view': 27, 'new_embodiment': 10, 'so100_2rgb': 6, 'so100_3rgb': 6, 'robomind_agilex_3rgb': 4, 'robomind_franka_1rgb': 5, 'robomind_franka_3rgb': 5, 'robomind_tienkung_gello_1rgb': 11, 'robomind_ur_1rgb': 12, 'robomind_tienkung_xsens_1rgb': 13, 'molmoact_franka_3rgb': 14, 'galaxea_r1_4rgb': 15}",
20
+ "shortest_image_edge": "256",
21
+ "crop_fraction": "0.95",
22
+ "use_albumentations": "True",
23
+ "train_image_transform": "ReplayCompose([\n SmallestMaxSize(p=1.0, max_size=[256], interpolation=3),\n FractionalRandomCrop(p=1.0, crop_fraction=0.95),\n SmallestMaxSize(p=1.0, max_size=[256], interpolation=3),\n ColorJitter(p=1.0, brightness=(0.9, 1.1), contrast=(0.9, 1.1), saturation=(0.9, 1.1), hue=(-0.1, 0.1)),\n], p=1.0, bbox_params=None, keypoint_params=None, additional_targets={}, is_check_shapes=True, save_key='replay')",
24
+ "eval_image_transform": "Compose([\n SmallestMaxSize(p=1.0, max_size=[256], interpolation=3),\n FractionalCenterCrop(p=1.0, crop_fraction=0.95),\n SmallestMaxSize(p=1.0, max_size=[256], interpolation=3),\n], p=1.0, bbox_params=None, keypoint_params=None, additional_targets={}, is_check_shapes=True)",
25
+ "_collator": "Gr00tN1d7DataCollator(model_name=nvidia/Cosmos-Reason2-2B, model_type=qwen)",
26
+ "training": "True"
27
+ }
experiment_cfg/initial_actions.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c34d9e522cf105de4e5c8b2c3d6024cb1e40cdbfcf7c64c7ff9b0fc738c6f932
3
+ size 1564431
latest ADDED
@@ -0,0 +1 @@
 
 
1
+ global_step196000
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a1a1d8a33c99103c7c80c136073c5bb8bfe9ca8f7a970c93c033ea89742906d
3
+ size 4990519232
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c3f61940deb2007ba1ad7743013b57f0f8462356151db9655175d7aca2d40661
3
+ size 1919980184
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
processor_config.json ADDED
@@ -0,0 +1,955 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "processor_class": "Gr00tN1d7Processor",
3
+ "processor_kwargs": {
4
+ "modality_configs": {
5
+ "real_g1_relative_eef_relative_joints": {
6
+ "video": {
7
+ "delta_indices": [
8
+ -20,
9
+ 0
10
+ ],
11
+ "modality_keys": [
12
+ "ego_view"
13
+ ]
14
+ },
15
+ "state": {
16
+ "delta_indices": [
17
+ 0
18
+ ],
19
+ "modality_keys": [
20
+ "left_wrist_eef_9d",
21
+ "right_wrist_eef_9d",
22
+ "left_hand",
23
+ "right_hand",
24
+ "left_arm",
25
+ "right_arm",
26
+ "waist"
27
+ ]
28
+ },
29
+ "action": {
30
+ "delta_indices": [
31
+ 0,
32
+ 1,
33
+ 2,
34
+ 3,
35
+ 4,
36
+ 5,
37
+ 6,
38
+ 7,
39
+ 8,
40
+ 9,
41
+ 10,
42
+ 11,
43
+ 12,
44
+ 13,
45
+ 14,
46
+ 15,
47
+ 16,
48
+ 17,
49
+ 18,
50
+ 19,
51
+ 20,
52
+ 21,
53
+ 22,
54
+ 23,
55
+ 24,
56
+ 25,
57
+ 26,
58
+ 27,
59
+ 28,
60
+ 29,
61
+ 30,
62
+ 31,
63
+ 32,
64
+ 33,
65
+ 34,
66
+ 35,
67
+ 36,
68
+ 37,
69
+ 38,
70
+ 39
71
+ ],
72
+ "modality_keys": [
73
+ "left_wrist_eef_9d",
74
+ "right_wrist_eef_9d",
75
+ "left_hand",
76
+ "right_hand",
77
+ "left_arm",
78
+ "right_arm",
79
+ "waist",
80
+ "base_height_command",
81
+ "navigate_command"
82
+ ],
83
+ "action_configs": [
84
+ {
85
+ "rep": "RELATIVE",
86
+ "type": "EEF",
87
+ "format": "XYZ_ROT6D",
88
+ "state_key": "left_wrist_eef_9d"
89
+ },
90
+ {
91
+ "rep": "RELATIVE",
92
+ "type": "EEF",
93
+ "format": "XYZ_ROT6D",
94
+ "state_key": "right_wrist_eef_9d"
95
+ },
96
+ {
97
+ "rep": "ABSOLUTE",
98
+ "type": "NON_EEF",
99
+ "format": "DEFAULT",
100
+ "state_key": "left_hand"
101
+ },
102
+ {
103
+ "rep": "ABSOLUTE",
104
+ "type": "NON_EEF",
105
+ "format": "DEFAULT",
106
+ "state_key": "right_hand"
107
+ },
108
+ {
109
+ "rep": "RELATIVE",
110
+ "type": "NON_EEF",
111
+ "format": "DEFAULT",
112
+ "state_key": "left_arm"
113
+ },
114
+ {
115
+ "rep": "RELATIVE",
116
+ "type": "NON_EEF",
117
+ "format": "DEFAULT",
118
+ "state_key": "right_arm"
119
+ },
120
+ {
121
+ "rep": "ABSOLUTE",
122
+ "type": "NON_EEF",
123
+ "format": "DEFAULT",
124
+ "state_key": "waist"
125
+ },
126
+ {
127
+ "rep": "ABSOLUTE",
128
+ "type": "NON_EEF",
129
+ "format": "DEFAULT",
130
+ "state_key": "base_height_command"
131
+ },
132
+ {
133
+ "rep": "ABSOLUTE",
134
+ "type": "NON_EEF",
135
+ "format": "DEFAULT",
136
+ "state_key": "navigate_command"
137
+ }
138
+ ]
139
+ },
140
+ "language": {
141
+ "delta_indices": [
142
+ 0
143
+ ],
144
+ "modality_keys": [
145
+ "annotation.human.task_description"
146
+ ]
147
+ }
148
+ },
149
+ "real_r1_pro_sharpa_relative_eef_mecka": {
150
+ "video": {
151
+ "delta_indices": [
152
+ -30,
153
+ 0
154
+ ],
155
+ "modality_keys": [
156
+ "ego_view_cropratio_res320x240_freq30"
157
+ ]
158
+ },
159
+ "state": {
160
+ "delta_indices": [
161
+ 0
162
+ ],
163
+ "modality_keys": [
164
+ "left_wrist_eef",
165
+ "right_wrist_eef",
166
+ "left_hand_joints",
167
+ "right_hand_joints"
168
+ ]
169
+ },
170
+ "action": {
171
+ "delta_indices": [
172
+ 0,
173
+ 1,
174
+ 2,
175
+ 3,
176
+ 4,
177
+ 5,
178
+ 6,
179
+ 7,
180
+ 8,
181
+ 9,
182
+ 10,
183
+ 11,
184
+ 12,
185
+ 13,
186
+ 14,
187
+ 15,
188
+ 16,
189
+ 17,
190
+ 18,
191
+ 19,
192
+ 20,
193
+ 21,
194
+ 22,
195
+ 23,
196
+ 24,
197
+ 25,
198
+ 26,
199
+ 27,
200
+ 28,
201
+ 29,
202
+ 30,
203
+ 31,
204
+ 32,
205
+ 33,
206
+ 34,
207
+ 35,
208
+ 36,
209
+ 37,
210
+ 38,
211
+ 39
212
+ ],
213
+ "modality_keys": [
214
+ "left_wrist_eef",
215
+ "right_wrist_eef",
216
+ "left_hand_joints",
217
+ "right_hand_joints"
218
+ ],
219
+ "action_configs": [
220
+ {
221
+ "rep": "RELATIVE",
222
+ "type": "EEF",
223
+ "format": "XYZ_ROT6D",
224
+ "state_key": "left_wrist_eef"
225
+ },
226
+ {
227
+ "rep": "RELATIVE",
228
+ "type": "EEF",
229
+ "format": "XYZ_ROT6D",
230
+ "state_key": "right_wrist_eef"
231
+ },
232
+ {
233
+ "rep": "ABSOLUTE",
234
+ "type": "NON_EEF",
235
+ "format": "DEFAULT",
236
+ "state_key": "left_hand_joints"
237
+ },
238
+ {
239
+ "rep": "ABSOLUTE",
240
+ "type": "NON_EEF",
241
+ "format": "DEFAULT",
242
+ "state_key": "right_hand_joints"
243
+ }
244
+ ]
245
+ },
246
+ "language": {
247
+ "delta_indices": [
248
+ 0
249
+ ],
250
+ "modality_keys": [
251
+ "annotation.human.coarse_action"
252
+ ]
253
+ }
254
+ },
255
+ "real_r1_pro_sharpa_relative_eef_human": {
256
+ "video": {
257
+ "delta_indices": [
258
+ -20,
259
+ 0
260
+ ],
261
+ "modality_keys": [
262
+ "ego_view_res320x240_freq20",
263
+ "left_wrist_view_res320x240_freq20",
264
+ "right_wrist_view_res320x240_freq20"
265
+ ]
266
+ },
267
+ "state": {
268
+ "delta_indices": [
269
+ 0
270
+ ],
271
+ "modality_keys": [
272
+ "left_wrist_eef",
273
+ "right_wrist_eef",
274
+ "left_hand_joints",
275
+ "right_hand_joints"
276
+ ]
277
+ },
278
+ "action": {
279
+ "delta_indices": [
280
+ 0,
281
+ 1,
282
+ 2,
283
+ 3,
284
+ 4,
285
+ 5,
286
+ 6,
287
+ 7,
288
+ 8,
289
+ 9,
290
+ 10,
291
+ 11,
292
+ 12,
293
+ 13,
294
+ 14,
295
+ 15,
296
+ 16,
297
+ 17,
298
+ 18,
299
+ 19,
300
+ 20,
301
+ 21,
302
+ 22,
303
+ 23,
304
+ 24,
305
+ 25,
306
+ 26,
307
+ 27,
308
+ 28,
309
+ 29,
310
+ 30,
311
+ 31,
312
+ 32,
313
+ 33,
314
+ 34,
315
+ 35,
316
+ 36,
317
+ 37,
318
+ 38,
319
+ 39
320
+ ],
321
+ "modality_keys": [
322
+ "left_wrist_eef",
323
+ "right_wrist_eef",
324
+ "left_hand_joints",
325
+ "right_hand_joints"
326
+ ],
327
+ "action_configs": [
328
+ {
329
+ "rep": "RELATIVE",
330
+ "type": "EEF",
331
+ "format": "XYZ_ROT6D",
332
+ "state_key": "left_wrist_eef"
333
+ },
334
+ {
335
+ "rep": "RELATIVE",
336
+ "type": "EEF",
337
+ "format": "XYZ_ROT6D",
338
+ "state_key": "right_wrist_eef"
339
+ },
340
+ {
341
+ "rep": "ABSOLUTE",
342
+ "type": "NON_EEF",
343
+ "format": "DEFAULT",
344
+ "state_key": "left_hand_joints"
345
+ },
346
+ {
347
+ "rep": "ABSOLUTE",
348
+ "type": "NON_EEF",
349
+ "format": "DEFAULT",
350
+ "state_key": "right_hand_joints"
351
+ }
352
+ ]
353
+ },
354
+ "language": {
355
+ "delta_indices": [
356
+ 0
357
+ ],
358
+ "modality_keys": [
359
+ "annotation.human.coarse_action"
360
+ ]
361
+ }
362
+ },
363
+ "real_r1_pro_sharpa_relative_eef": {
364
+ "video": {
365
+ "delta_indices": [
366
+ -20,
367
+ 0
368
+ ],
369
+ "modality_keys": [
370
+ "ego_view_res320x240_freq20",
371
+ "left_wrist_view_res320x240_freq20",
372
+ "right_wrist_view_res320x240_freq20"
373
+ ]
374
+ },
375
+ "state": {
376
+ "delta_indices": [
377
+ 0
378
+ ],
379
+ "modality_keys": [
380
+ "left_wrist_eef",
381
+ "right_wrist_eef",
382
+ "left_hand_joints",
383
+ "right_hand_joints"
384
+ ]
385
+ },
386
+ "action": {
387
+ "delta_indices": [
388
+ 0,
389
+ 1,
390
+ 2,
391
+ 3,
392
+ 4,
393
+ 5,
394
+ 6,
395
+ 7,
396
+ 8,
397
+ 9,
398
+ 10,
399
+ 11,
400
+ 12,
401
+ 13,
402
+ 14,
403
+ 15,
404
+ 16,
405
+ 17,
406
+ 18,
407
+ 19,
408
+ 20,
409
+ 21,
410
+ 22,
411
+ 23,
412
+ 24,
413
+ 25,
414
+ 26,
415
+ 27,
416
+ 28,
417
+ 29,
418
+ 30,
419
+ 31,
420
+ 32,
421
+ 33,
422
+ 34,
423
+ 35,
424
+ 36,
425
+ 37,
426
+ 38,
427
+ 39
428
+ ],
429
+ "modality_keys": [
430
+ "left_wrist_eef",
431
+ "right_wrist_eef",
432
+ "left_hand_joints",
433
+ "right_hand_joints"
434
+ ],
435
+ "action_configs": [
436
+ {
437
+ "rep": "RELATIVE",
438
+ "type": "EEF",
439
+ "format": "XYZ_ROT6D",
440
+ "state_key": "left_wrist_eef"
441
+ },
442
+ {
443
+ "rep": "RELATIVE",
444
+ "type": "EEF",
445
+ "format": "XYZ_ROT6D",
446
+ "state_key": "right_wrist_eef"
447
+ },
448
+ {
449
+ "rep": "ABSOLUTE",
450
+ "type": "NON_EEF",
451
+ "format": "DEFAULT",
452
+ "state_key": "left_hand_joints"
453
+ },
454
+ {
455
+ "rep": "ABSOLUTE",
456
+ "type": "NON_EEF",
457
+ "format": "DEFAULT",
458
+ "state_key": "right_hand_joints"
459
+ }
460
+ ]
461
+ },
462
+ "language": {
463
+ "delta_indices": [
464
+ 0
465
+ ],
466
+ "modality_keys": [
467
+ "annotation.human.coarse_action"
468
+ ]
469
+ }
470
+ },
471
+ "xdof_relative_eef_relative_joint": {
472
+ "video": {
473
+ "delta_indices": [
474
+ -30,
475
+ 0
476
+ ],
477
+ "modality_keys": [
478
+ "top_camera-images-rgb_320_240",
479
+ "left_camera-images-rgb_320_240",
480
+ "right_camera-images-rgb_320_240"
481
+ ]
482
+ },
483
+ "state": {
484
+ "delta_indices": [
485
+ 0
486
+ ],
487
+ "modality_keys": [
488
+ "left_wrist_eef",
489
+ "right_wrist_eef",
490
+ "left_gripper_pos",
491
+ "right_gripper_pos",
492
+ "left_joint_pos",
493
+ "right_joint_pos"
494
+ ]
495
+ },
496
+ "action": {
497
+ "delta_indices": [
498
+ 0,
499
+ 1,
500
+ 2,
501
+ 3,
502
+ 4,
503
+ 5,
504
+ 6,
505
+ 7,
506
+ 8,
507
+ 9,
508
+ 10,
509
+ 11,
510
+ 12,
511
+ 13,
512
+ 14,
513
+ 15,
514
+ 16,
515
+ 17,
516
+ 18,
517
+ 19,
518
+ 20,
519
+ 21,
520
+ 22,
521
+ 23,
522
+ 24,
523
+ 25,
524
+ 26,
525
+ 27,
526
+ 28,
527
+ 29,
528
+ 30,
529
+ 31,
530
+ 32,
531
+ 33,
532
+ 34,
533
+ 35,
534
+ 36,
535
+ 37,
536
+ 38,
537
+ 39
538
+ ],
539
+ "modality_keys": [
540
+ "left_wrist_eef",
541
+ "right_wrist_eef",
542
+ "left_gripper_pos",
543
+ "right_gripper_pos",
544
+ "left_joint_pos",
545
+ "right_joint_pos"
546
+ ],
547
+ "action_configs": [
548
+ {
549
+ "rep": "RELATIVE",
550
+ "type": "EEF",
551
+ "format": "XYZ_ROT6D",
552
+ "state_key": "left_wrist_eef"
553
+ },
554
+ {
555
+ "rep": "RELATIVE",
556
+ "type": "EEF",
557
+ "format": "XYZ_ROT6D",
558
+ "state_key": "right_wrist_eef"
559
+ },
560
+ {
561
+ "rep": "ABSOLUTE",
562
+ "type": "NON_EEF",
563
+ "format": "DEFAULT",
564
+ "state_key": "left_gripper_pos"
565
+ },
566
+ {
567
+ "rep": "ABSOLUTE",
568
+ "type": "NON_EEF",
569
+ "format": "DEFAULT",
570
+ "state_key": "right_gripper_pos"
571
+ },
572
+ {
573
+ "rep": "RELATIVE",
574
+ "type": "NON_EEF",
575
+ "format": "DEFAULT",
576
+ "state_key": "left_joint_pos"
577
+ },
578
+ {
579
+ "rep": "RELATIVE",
580
+ "type": "NON_EEF",
581
+ "format": "DEFAULT",
582
+ "state_key": "right_joint_pos"
583
+ }
584
+ ]
585
+ },
586
+ "language": {
587
+ "delta_indices": [
588
+ 0
589
+ ],
590
+ "modality_keys": [
591
+ "annotation.task"
592
+ ]
593
+ }
594
+ },
595
+ "real_r1_pro_sharpa_relative_eef_maxinsights": {
596
+ "video": {
597
+ "delta_indices": [
598
+ -30,
599
+ 0
600
+ ],
601
+ "modality_keys": [
602
+ "ego_view_cropratio_res320x240_freq30"
603
+ ]
604
+ },
605
+ "state": {
606
+ "delta_indices": [
607
+ 0
608
+ ],
609
+ "modality_keys": [
610
+ "left_wrist_eef",
611
+ "right_wrist_eef",
612
+ "left_hand_joints",
613
+ "right_hand_joints"
614
+ ]
615
+ },
616
+ "action": {
617
+ "delta_indices": [
618
+ 0,
619
+ 1,
620
+ 2,
621
+ 3,
622
+ 4,
623
+ 5,
624
+ 6,
625
+ 7,
626
+ 8,
627
+ 9,
628
+ 10,
629
+ 11,
630
+ 12,
631
+ 13,
632
+ 14,
633
+ 15,
634
+ 16,
635
+ 17,
636
+ 18,
637
+ 19,
638
+ 20,
639
+ 21,
640
+ 22,
641
+ 23,
642
+ 24,
643
+ 25,
644
+ 26,
645
+ 27,
646
+ 28,
647
+ 29,
648
+ 30,
649
+ 31,
650
+ 32,
651
+ 33,
652
+ 34,
653
+ 35,
654
+ 36,
655
+ 37,
656
+ 38,
657
+ 39
658
+ ],
659
+ "modality_keys": [
660
+ "left_wrist_eef",
661
+ "right_wrist_eef",
662
+ "left_hand_joints",
663
+ "right_hand_joints"
664
+ ],
665
+ "action_configs": [
666
+ {
667
+ "rep": "RELATIVE",
668
+ "type": "EEF",
669
+ "format": "XYZ_ROT6D",
670
+ "state_key": "left_wrist_eef"
671
+ },
672
+ {
673
+ "rep": "RELATIVE",
674
+ "type": "EEF",
675
+ "format": "XYZ_ROT6D",
676
+ "state_key": "right_wrist_eef"
677
+ },
678
+ {
679
+ "rep": "ABSOLUTE",
680
+ "type": "NON_EEF",
681
+ "format": "DEFAULT",
682
+ "state_key": "left_hand_joints"
683
+ },
684
+ {
685
+ "rep": "ABSOLUTE",
686
+ "type": "NON_EEF",
687
+ "format": "DEFAULT",
688
+ "state_key": "right_hand_joints"
689
+ }
690
+ ]
691
+ },
692
+ "language": {
693
+ "delta_indices": [
694
+ 0
695
+ ],
696
+ "modality_keys": [
697
+ "annotation.human.coarse_action"
698
+ ]
699
+ }
700
+ },
701
+ "xdof_relative_eef_relative_joint_subtask": {
702
+ "video": {
703
+ "delta_indices": [
704
+ -30,
705
+ 0
706
+ ],
707
+ "modality_keys": [
708
+ "top_camera-images-rgb_320_240",
709
+ "left_camera-images-rgb_320_240",
710
+ "right_camera-images-rgb_320_240"
711
+ ]
712
+ },
713
+ "state": {
714
+ "delta_indices": [
715
+ 0
716
+ ],
717
+ "modality_keys": [
718
+ "left_wrist_eef",
719
+ "right_wrist_eef",
720
+ "left_gripper_pos",
721
+ "right_gripper_pos",
722
+ "left_joint_pos",
723
+ "right_joint_pos"
724
+ ]
725
+ },
726
+ "action": {
727
+ "delta_indices": [
728
+ 0,
729
+ 1,
730
+ 2,
731
+ 3,
732
+ 4,
733
+ 5,
734
+ 6,
735
+ 7,
736
+ 8,
737
+ 9,
738
+ 10,
739
+ 11,
740
+ 12,
741
+ 13,
742
+ 14,
743
+ 15,
744
+ 16,
745
+ 17,
746
+ 18,
747
+ 19,
748
+ 20,
749
+ 21,
750
+ 22,
751
+ 23,
752
+ 24,
753
+ 25,
754
+ 26,
755
+ 27,
756
+ 28,
757
+ 29,
758
+ 30,
759
+ 31,
760
+ 32,
761
+ 33,
762
+ 34,
763
+ 35,
764
+ 36,
765
+ 37,
766
+ 38,
767
+ 39
768
+ ],
769
+ "modality_keys": [
770
+ "left_wrist_eef",
771
+ "right_wrist_eef",
772
+ "left_gripper_pos",
773
+ "right_gripper_pos",
774
+ "left_joint_pos",
775
+ "right_joint_pos"
776
+ ],
777
+ "action_configs": [
778
+ {
779
+ "rep": "RELATIVE",
780
+ "type": "EEF",
781
+ "format": "XYZ_ROT6D",
782
+ "state_key": "left_wrist_eef"
783
+ },
784
+ {
785
+ "rep": "RELATIVE",
786
+ "type": "EEF",
787
+ "format": "XYZ_ROT6D",
788
+ "state_key": "right_wrist_eef"
789
+ },
790
+ {
791
+ "rep": "ABSOLUTE",
792
+ "type": "NON_EEF",
793
+ "format": "DEFAULT",
794
+ "state_key": "left_gripper_pos"
795
+ },
796
+ {
797
+ "rep": "ABSOLUTE",
798
+ "type": "NON_EEF",
799
+ "format": "DEFAULT",
800
+ "state_key": "right_gripper_pos"
801
+ },
802
+ {
803
+ "rep": "RELATIVE",
804
+ "type": "NON_EEF",
805
+ "format": "DEFAULT",
806
+ "state_key": "left_joint_pos"
807
+ },
808
+ {
809
+ "rep": "RELATIVE",
810
+ "type": "NON_EEF",
811
+ "format": "DEFAULT",
812
+ "state_key": "right_joint_pos"
813
+ }
814
+ ]
815
+ },
816
+ "language": {
817
+ "delta_indices": [
818
+ 0
819
+ ],
820
+ "modality_keys": [
821
+ "annotation.sub_task"
822
+ ]
823
+ }
824
+ },
825
+ "oxe_droid_relative_eef_relative_joint": {
826
+ "video": {
827
+ "delta_indices": [
828
+ -15,
829
+ 0
830
+ ],
831
+ "modality_keys": [
832
+ "exterior_image_1_left",
833
+ "wrist_image_left"
834
+ ]
835
+ },
836
+ "state": {
837
+ "delta_indices": [
838
+ 0
839
+ ],
840
+ "modality_keys": [
841
+ "eef_9d",
842
+ "gripper_position",
843
+ "joint_position"
844
+ ]
845
+ },
846
+ "action": {
847
+ "delta_indices": [
848
+ 0,
849
+ 1,
850
+ 2,
851
+ 3,
852
+ 4,
853
+ 5,
854
+ 6,
855
+ 7,
856
+ 8,
857
+ 9,
858
+ 10,
859
+ 11,
860
+ 12,
861
+ 13,
862
+ 14,
863
+ 15,
864
+ 16,
865
+ 17,
866
+ 18,
867
+ 19,
868
+ 20,
869
+ 21,
870
+ 22,
871
+ 23,
872
+ 24,
873
+ 25,
874
+ 26,
875
+ 27,
876
+ 28,
877
+ 29,
878
+ 30,
879
+ 31,
880
+ 32,
881
+ 33,
882
+ 34,
883
+ 35,
884
+ 36,
885
+ 37,
886
+ 38,
887
+ 39
888
+ ],
889
+ "modality_keys": [
890
+ "eef_9d",
891
+ "gripper_position",
892
+ "joint_position"
893
+ ],
894
+ "action_configs": [
895
+ {
896
+ "rep": "RELATIVE",
897
+ "type": "EEF",
898
+ "format": "XYZ_ROT6D",
899
+ "state_key": "eef_9d"
900
+ },
901
+ {
902
+ "rep": "ABSOLUTE",
903
+ "type": "NON_EEF",
904
+ "format": "DEFAULT",
905
+ "state_key": "gripper_position"
906
+ },
907
+ {
908
+ "rep": "RELATIVE",
909
+ "type": "NON_EEF",
910
+ "format": "DEFAULT",
911
+ "state_key": "joint_position"
912
+ }
913
+ ]
914
+ },
915
+ "language": {
916
+ "delta_indices": [
917
+ 0
918
+ ],
919
+ "modality_keys": [
920
+ "annotation.language.language_instruction"
921
+ ]
922
+ }
923
+ }
924
+ },
925
+ "use_percentiles": true,
926
+ "use_mean_std": false,
927
+ "image_crop_size": [
928
+ 230,
929
+ 230
930
+ ],
931
+ "image_target_size": [
932
+ 256,
933
+ 256
934
+ ],
935
+ "formalize_language": true,
936
+ "max_state_dim": 132,
937
+ "max_action_dim": 132,
938
+ "apply_sincos_state_encoding": false,
939
+ "color_jitter_params": {
940
+ "brightness": 0.3,
941
+ "contrast": 0.4,
942
+ "saturation": 0.5,
943
+ "hue": 0.08
944
+ },
945
+ "random_rotation_angle": 0,
946
+ "letter_box_transform": false,
947
+ "exclude_state": false,
948
+ "state_dropout_prob": 0.2,
949
+ "use_albumentations": true,
950
+ "shortest_image_edge": 256,
951
+ "crop_fraction": 0.95,
952
+ "max_action_horizon": 40,
953
+ "use_relative_action": true
954
+ }
955
+ }
scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5ae78127cbbf7977166b80c05e2b2a0f95f6a251fd8b8fbac5f86c0cfe018ffe
3
+ size 1263
statistics.json ADDED
The diff for this file is too large to render. See raw diff
 
trainer_state.json ADDED
The diff for this file is too large to render. See raw diff
 
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c45bcc6f4918ba6ffc8c1dd3aca5d223778a483fcede2c19948cc43f206fc0d
3
+ size 8259
wandb_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"project": "human_pretraining_n15_galaxea_sharpa", "run_id": "pretrain_n17_qwen3vl_2b_finetuned_sft_albumentations_2frame_abshand_frozen_vlm_double_dit_vl_interleaved_lr5e-5_200ksteps_batchsize8192"}
zero_to_fp32.py ADDED
@@ -0,0 +1,760 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+
3
+ # Copyright (c) Microsoft Corporation.
4
+ # SPDX-License-Identifier: Apache-2.0
5
+
6
+ # DeepSpeed Team
7
+
8
+ # This script extracts fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints. It gets
9
+ # copied into the top level checkpoint dir, so the user can easily do the conversion at any point in
10
+ # the future. Once extracted, the weights don't require DeepSpeed and can be used in any
11
+ # application.
12
+ #
13
+ # example:
14
+ # python zero_to_fp32.py . output_dir/
15
+ # or
16
+ # python zero_to_fp32.py . output_dir/ --safe_serialization
17
+
18
+ import argparse
19
+ import torch
20
+ import glob
21
+ import math
22
+ import os
23
+ import re
24
+ import gc
25
+ import json
26
+ import numpy as np
27
+ from tqdm import tqdm
28
+ from collections import OrderedDict
29
+ from dataclasses import dataclass
30
+
31
+ # while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
32
+ # DeepSpeed data structures it has to be available in the current python environment.
33
+ from deepspeed.utils import logger
34
+ from deepspeed.checkpoint.constants import (DS_VERSION, OPTIMIZER_STATE_DICT, SINGLE_PARTITION_OF_FP32_GROUPS,
35
+ FP32_FLAT_GROUPS, ZERO_STAGE, PARTITION_COUNT, PARAM_SHAPES, BUFFER_NAMES,
36
+ FROZEN_PARAM_SHAPES, FROZEN_PARAM_FRAGMENTS)
37
+
38
+
39
+ @dataclass
40
+ class zero_model_state:
41
+ buffers: dict()
42
+ param_shapes: dict()
43
+ shared_params: list
44
+ ds_version: int
45
+ frozen_param_shapes: dict()
46
+ frozen_param_fragments: dict()
47
+
48
+
49
+ debug = 0
50
+
51
+ # load to cpu
52
+ device = torch.device('cpu')
53
+
54
+
55
+ def atoi(text):
56
+ return int(text) if text.isdigit() else text
57
+
58
+
59
+ def natural_keys(text):
60
+ '''
61
+ alist.sort(key=natural_keys) sorts in human order
62
+ http://nedbatchelder.com/blog/200712/human_sorting.html
63
+ (See Toothy's implementation in the comments)
64
+ '''
65
+ return [atoi(c) for c in re.split(r'(\d+)', text)]
66
+
67
+
68
+ def get_model_state_file(checkpoint_dir, zero_stage):
69
+ if not os.path.isdir(checkpoint_dir):
70
+ raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
71
+
72
+ # there should be only one file
73
+ if zero_stage <= 2:
74
+ file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
75
+ elif zero_stage == 3:
76
+ file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
77
+
78
+ if not os.path.exists(file):
79
+ raise FileNotFoundError(f"can't find model states file at '{file}'")
80
+
81
+ return file
82
+
83
+
84
+ def get_checkpoint_files(checkpoint_dir, glob_pattern):
85
+ # XXX: need to test that this simple glob rule works for multi-node setup too
86
+ ckpt_files = sorted(glob.glob(os.path.join(checkpoint_dir, glob_pattern)), key=natural_keys)
87
+
88
+ if len(ckpt_files) == 0:
89
+ raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
90
+
91
+ return ckpt_files
92
+
93
+
94
+ def get_optim_files(checkpoint_dir):
95
+ return get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
96
+
97
+
98
+ def get_model_state_files(checkpoint_dir):
99
+ return get_checkpoint_files(checkpoint_dir, "*_model_states.pt")
100
+
101
+
102
+ def parse_model_states(files):
103
+ zero_model_states = []
104
+ for file in files:
105
+ state_dict = torch.load(file, map_location=device, weights_only=False)
106
+
107
+ if BUFFER_NAMES not in state_dict:
108
+ raise ValueError(f"{file} is not a model state checkpoint")
109
+ buffer_names = state_dict[BUFFER_NAMES]
110
+ if debug:
111
+ print("Found buffers:", buffer_names)
112
+
113
+ # recover just the buffers while restoring them to fp32 if they were saved in fp16
114
+ buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
115
+ param_shapes = state_dict[PARAM_SHAPES]
116
+
117
+ # collect parameters that are included in param_shapes
118
+ param_names = []
119
+ for s in param_shapes:
120
+ for name in s.keys():
121
+ param_names.append(name)
122
+
123
+ # update with frozen parameters
124
+ frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
125
+ if frozen_param_shapes is not None:
126
+ if debug:
127
+ print(f"Found frozen_param_shapes: {frozen_param_shapes}")
128
+ param_names += list(frozen_param_shapes.keys())
129
+
130
+ # handle shared params
131
+ shared_params = [[k, v] for k, v in state_dict["shared_params"].items()]
132
+
133
+ ds_version = state_dict.get(DS_VERSION, None)
134
+
135
+ frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
136
+
137
+ z_model_state = zero_model_state(buffers=buffers,
138
+ param_shapes=param_shapes,
139
+ shared_params=shared_params,
140
+ ds_version=ds_version,
141
+ frozen_param_shapes=frozen_param_shapes,
142
+ frozen_param_fragments=frozen_param_fragments)
143
+ zero_model_states.append(z_model_state)
144
+
145
+ return zero_model_states
146
+
147
+
148
+ def parse_optim_states(files, ds_checkpoint_dir):
149
+ total_files = len(files)
150
+ state_dicts = []
151
+ for f in tqdm(files, desc='Loading checkpoint shards'):
152
+ state_dict = torch.load(f, map_location=device, mmap=True, weights_only=False)
153
+ # immediately discard the potentially huge 2 optimizer states as we only care for fp32 master weights
154
+ # and also handle the case where it was already removed by another helper script
155
+ state_dict["optimizer_state_dict"].pop("optimizer_state_dict", None)
156
+ state_dicts.append(state_dict)
157
+
158
+ if ZERO_STAGE not in state_dicts[0][OPTIMIZER_STATE_DICT]:
159
+ raise ValueError(f"{files[0]} is not a zero checkpoint")
160
+ zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
161
+ world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
162
+
163
+ # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
164
+ # parameters can be different from data parallelism for non-expert parameters. So we can just
165
+ # use the max of the partition_count to get the dp world_size.
166
+
167
+ if type(world_size) is list:
168
+ world_size = max(world_size)
169
+
170
+ if world_size != total_files:
171
+ raise ValueError(
172
+ f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
173
+ "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
174
+ )
175
+
176
+ # the groups are named differently in each stage
177
+ if zero_stage <= 2:
178
+ fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
179
+ elif zero_stage == 3:
180
+ fp32_groups_key = FP32_FLAT_GROUPS
181
+ else:
182
+ raise ValueError(f"unknown zero stage {zero_stage}")
183
+
184
+ fp32_flat_groups = [state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key] for i in range(len(state_dicts))]
185
+ return zero_stage, world_size, fp32_flat_groups
186
+
187
+
188
+ def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters):
189
+ """
190
+ Returns fp32 state_dict reconstructed from ds checkpoint
191
+
192
+ Args:
193
+ - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
194
+
195
+ """
196
+ print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
197
+
198
+ optim_files = get_optim_files(ds_checkpoint_dir)
199
+ zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
200
+ print(f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
201
+
202
+ model_files = get_model_state_files(ds_checkpoint_dir)
203
+
204
+ zero_model_states = parse_model_states(model_files)
205
+ print(f'Parsing checkpoint created by deepspeed=={zero_model_states[0].ds_version}')
206
+
207
+ if zero_stage <= 2:
208
+ return _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
209
+ exclude_frozen_parameters)
210
+ elif zero_stage == 3:
211
+ return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
212
+ exclude_frozen_parameters)
213
+
214
+
215
+ def _zero2_merge_frozen_params(state_dict, zero_model_states):
216
+ if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
217
+ return
218
+
219
+ frozen_param_shapes = zero_model_states[0].frozen_param_shapes
220
+ frozen_param_fragments = zero_model_states[0].frozen_param_fragments
221
+
222
+ if debug:
223
+ num_elem = sum(s.numel() for s in frozen_param_shapes.values())
224
+ print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
225
+
226
+ wanted_params = len(frozen_param_shapes)
227
+ wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
228
+ avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
229
+ print(f'Frozen params: Have {avail_numel} numels to process.')
230
+ print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
231
+
232
+ total_params = 0
233
+ total_numel = 0
234
+ for name, shape in frozen_param_shapes.items():
235
+ total_params += 1
236
+ unpartitioned_numel = shape.numel()
237
+ total_numel += unpartitioned_numel
238
+
239
+ state_dict[name] = frozen_param_fragments[name]
240
+
241
+ if debug:
242
+ print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
243
+
244
+ print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
245
+
246
+
247
+ def _has_callable(obj, fn):
248
+ attr = getattr(obj, fn, None)
249
+ return callable(attr)
250
+
251
+
252
+ def _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
253
+ param_shapes = zero_model_states[0].param_shapes
254
+
255
+ # Reconstruction protocol:
256
+ #
257
+ # XXX: document this
258
+
259
+ if debug:
260
+ for i in range(world_size):
261
+ for j in range(len(fp32_flat_groups[0])):
262
+ print(f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
263
+
264
+ # XXX: memory usage doubles here (zero2)
265
+ num_param_groups = len(fp32_flat_groups[0])
266
+ merged_single_partition_of_fp32_groups = []
267
+ for i in range(num_param_groups):
268
+ merged_partitions = [sd[i] for sd in fp32_flat_groups]
269
+ full_single_fp32_vector = torch.cat(merged_partitions, 0)
270
+ merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
271
+ avail_numel = sum(
272
+ [full_single_fp32_vector.numel() for full_single_fp32_vector in merged_single_partition_of_fp32_groups])
273
+
274
+ if debug:
275
+ wanted_params = sum([len(shapes) for shapes in param_shapes])
276
+ wanted_numel = sum([sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
277
+ # not asserting if there is a mismatch due to possible padding
278
+ print(f"Have {avail_numel} numels to process.")
279
+ print(f"Need {wanted_numel} numels in {wanted_params} params.")
280
+
281
+ # params
282
+ # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
283
+ # out-of-core computing solution
284
+ total_numel = 0
285
+ total_params = 0
286
+ for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
287
+ offset = 0
288
+ avail_numel = full_single_fp32_vector.numel()
289
+ for name, shape in shapes.items():
290
+
291
+ unpartitioned_numel = shape.numel() if _has_callable(shape, 'numel') else math.prod(shape)
292
+ total_numel += unpartitioned_numel
293
+ total_params += 1
294
+
295
+ if debug:
296
+ print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
297
+ state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
298
+ offset += unpartitioned_numel
299
+
300
+ # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
301
+ # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
302
+ # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
303
+ # live optimizer object, so we are checking that the numbers are within the right range
304
+ align_to = 2 * world_size
305
+
306
+ def zero2_align(x):
307
+ return align_to * math.ceil(x / align_to)
308
+
309
+ if debug:
310
+ print(f"original offset={offset}, avail_numel={avail_numel}")
311
+
312
+ offset = zero2_align(offset)
313
+ avail_numel = zero2_align(avail_numel)
314
+
315
+ if debug:
316
+ print(f"aligned offset={offset}, avail_numel={avail_numel}")
317
+
318
+ # Sanity check
319
+ if offset != avail_numel:
320
+ raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
321
+
322
+ print(f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements")
323
+
324
+
325
+ def _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
326
+ exclude_frozen_parameters):
327
+ state_dict = OrderedDict()
328
+
329
+ # buffers
330
+ buffers = zero_model_states[0].buffers
331
+ state_dict.update(buffers)
332
+ if debug:
333
+ print(f"added {len(buffers)} buffers")
334
+
335
+ if not exclude_frozen_parameters:
336
+ _zero2_merge_frozen_params(state_dict, zero_model_states)
337
+
338
+ _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
339
+
340
+ # recover shared parameters
341
+ for pair in zero_model_states[0].shared_params:
342
+ if pair[1] in state_dict:
343
+ state_dict[pair[0]] = state_dict[pair[1]]
344
+
345
+ return state_dict
346
+
347
+
348
+ def zero3_partitioned_param_info(unpartitioned_numel, world_size):
349
+ remainder = unpartitioned_numel % world_size
350
+ padding_numel = (world_size - remainder) if remainder else 0
351
+ partitioned_numel = math.ceil(unpartitioned_numel / world_size)
352
+ return partitioned_numel, padding_numel
353
+
354
+
355
+ def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
356
+ if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
357
+ return
358
+
359
+ if debug:
360
+ for i in range(world_size):
361
+ num_elem = sum(s.numel() for s in zero_model_states[i].frozen_param_fragments.values())
362
+ print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
363
+
364
+ frozen_param_shapes = zero_model_states[0].frozen_param_shapes
365
+ wanted_params = len(frozen_param_shapes)
366
+ wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
367
+ avail_numel = sum([p.numel() for p in zero_model_states[0].frozen_param_fragments.values()]) * world_size
368
+ print(f'Frozen params: Have {avail_numel} numels to process.')
369
+ print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
370
+
371
+ total_params = 0
372
+ total_numel = 0
373
+ for name, shape in zero_model_states[0].frozen_param_shapes.items():
374
+ total_params += 1
375
+ unpartitioned_numel = shape.numel()
376
+ total_numel += unpartitioned_numel
377
+
378
+ param_frags = tuple(model_state.frozen_param_fragments[name] for model_state in zero_model_states)
379
+ state_dict[name] = torch.cat(param_frags, 0).narrow(0, 0, unpartitioned_numel).view(shape)
380
+
381
+ partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
382
+
383
+ if debug:
384
+ print(
385
+ f"Frozen params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
386
+ )
387
+
388
+ print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
389
+
390
+
391
+ class GatheredTensor:
392
+ """
393
+ A pseudo tensor that collects partitioned weights.
394
+ It is more memory efficient when there are multiple groups.
395
+ """
396
+
397
+ def __init__(self, flat_groups, flat_groups_offset, offset, partitioned_numel, shape):
398
+ self.flat_groups = flat_groups
399
+ self.flat_groups_offset = flat_groups_offset
400
+ self.offset = offset
401
+ self.partitioned_numel = partitioned_numel
402
+ self.shape = shape
403
+ self.dtype = self.flat_groups[0][0].dtype
404
+
405
+ def contiguous(self):
406
+ """
407
+ Merge partitioned weights from flat_groups into a single tensor.
408
+ """
409
+ end_idx = self.offset + self.partitioned_numel
410
+ world_size = len(self.flat_groups)
411
+ pad_flat_param_chunks = []
412
+
413
+ for rank_i in range(world_size):
414
+ # for each rank, we need to collect weights from related group/groups
415
+ flat_groups_at_rank_i = self.flat_groups[rank_i]
416
+ start_group_id = None
417
+ end_group_id = None
418
+ for group_id in range(len(self.flat_groups_offset)):
419
+ if self.flat_groups_offset[group_id] <= self.offset < self.flat_groups_offset[group_id + 1]:
420
+ start_group_id = group_id
421
+ if self.flat_groups_offset[group_id] < end_idx <= self.flat_groups_offset[group_id + 1]:
422
+ end_group_id = group_id
423
+ break
424
+ # collect weights from related group/groups
425
+ for group_id in range(start_group_id, end_group_id + 1):
426
+ flat_tensor = flat_groups_at_rank_i[group_id]
427
+ start_offset = self.offset - self.flat_groups_offset[group_id]
428
+ end_offset = min(end_idx, self.flat_groups_offset[group_id + 1]) - self.flat_groups_offset[group_id]
429
+ pad_flat_param_chunks.append(flat_tensor[start_offset:end_offset])
430
+
431
+ # collect weights from all ranks
432
+ pad_flat_param = torch.cat(pad_flat_param_chunks, dim=0)
433
+ param = pad_flat_param[:self.shape.numel()].view(self.shape).contiguous()
434
+ return param
435
+
436
+
437
+ def _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
438
+ param_shapes = zero_model_states[0].param_shapes
439
+ avail_numel = sum([flat_group.numel() for flat_group in fp32_flat_groups[0]]) * world_size
440
+
441
+ # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
442
+ # param, re-consolidating each param, while dealing with padding if any
443
+
444
+ # merge list of dicts, preserving order
445
+ param_shapes = {k: v for d in param_shapes for k, v in d.items()}
446
+
447
+ if debug:
448
+ for i in range(world_size):
449
+ print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
450
+
451
+ wanted_params = len(param_shapes)
452
+ wanted_numel = sum(shape.numel() for shape in param_shapes.values())
453
+ # not asserting if there is a mismatch due to possible padding
454
+ avail_numel = fp32_flat_groups[0].numel() * world_size
455
+ print(f"Trainable params: Have {avail_numel} numels to process.")
456
+ print(f"Trainable params: Need {wanted_numel} numels in {wanted_params} params.")
457
+
458
+ # params
459
+ # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
460
+ # out-of-core computing solution
461
+ offset = 0
462
+ total_numel = 0
463
+ total_params = 0
464
+ flat_groups_offset = [0] + list(np.cumsum([flat_tensor.numel() for flat_tensor in fp32_flat_groups[0]]))
465
+ for name, shape in tqdm(param_shapes.items(), desc='Gathering sharded weights'):
466
+ unpartitioned_numel = shape.numel()
467
+ total_numel += unpartitioned_numel
468
+ total_params += 1
469
+ partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
470
+
471
+ if debug:
472
+ print(
473
+ f"Trainable params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
474
+ )
475
+
476
+ # memory efficient tensor
477
+ tensor = GatheredTensor(fp32_flat_groups, flat_groups_offset, offset, partitioned_numel, shape)
478
+ state_dict[name] = tensor
479
+ offset += partitioned_numel
480
+
481
+ offset *= world_size
482
+
483
+ # Sanity check
484
+ if offset != avail_numel:
485
+ raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
486
+
487
+ print(f"Reconstructed Trainable fp32 state dict with {total_params} params {total_numel} elements")
488
+
489
+
490
+ def _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
491
+ exclude_frozen_parameters):
492
+ state_dict = OrderedDict()
493
+
494
+ # buffers
495
+ buffers = zero_model_states[0].buffers
496
+ state_dict.update(buffers)
497
+ if debug:
498
+ print(f"added {len(buffers)} buffers")
499
+
500
+ if not exclude_frozen_parameters:
501
+ _zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
502
+
503
+ _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
504
+
505
+ # recover shared parameters
506
+ for pair in zero_model_states[0].shared_params:
507
+ if pair[1] in state_dict:
508
+ state_dict[pair[0]] = state_dict[pair[1]]
509
+
510
+ return state_dict
511
+
512
+
513
+ def to_torch_tensor(state_dict, return_empty_tensor=False):
514
+ """
515
+ Convert state_dict of GatheredTensor to torch tensor
516
+ """
517
+ torch_state_dict = {}
518
+ converted_tensors = {}
519
+ for name, tensor in state_dict.items():
520
+ tensor_id = id(tensor)
521
+ if tensor_id in converted_tensors: # shared tensors
522
+ shared_tensor = torch_state_dict[converted_tensors[tensor_id]]
523
+ torch_state_dict[name] = shared_tensor
524
+ else:
525
+ converted_tensors[tensor_id] = name
526
+ if return_empty_tensor:
527
+ torch_state_dict[name] = torch.empty(tensor.shape, dtype=tensor.dtype)
528
+ else:
529
+ torch_state_dict[name] = tensor.contiguous()
530
+ return torch_state_dict
531
+
532
+
533
+ def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir,
534
+ tag=None,
535
+ exclude_frozen_parameters=False,
536
+ lazy_mode=False):
537
+ """
538
+ Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
539
+ ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
540
+ via a model hub.
541
+
542
+ Args:
543
+ - ``checkpoint_dir``: path to the desired checkpoint folder
544
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
545
+ - ``exclude_frozen_parameters``: exclude frozen parameters
546
+ - ``lazy_mode``: get state_dict in lazy mode. It returns a dict of pesduo tensor instead of torch tensor, which is more memory efficient.
547
+ Convert the pesduo tensor to torch tensor by ``.contiguous()``
548
+
549
+ Returns:
550
+ - pytorch ``state_dict``
551
+
552
+ A typical usage might be ::
553
+
554
+ from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
555
+ # do the training and checkpoint saving
556
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
557
+ model = model.cpu() # move to cpu
558
+ model.load_state_dict(state_dict)
559
+ # submit to model hub or save the model to share with others
560
+
561
+ In this example the ``model`` will no longer be usable in the deepspeed context of the same
562
+ application. i.e. you will need to re-initialize the deepspeed engine, since
563
+ ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
564
+
565
+ If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
566
+
567
+ Note: the above usage may not work if your application doesn't have sufficient free CPU memory.
568
+ You may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
569
+ the checkpoint. Or you can load state_dict in lazy mode ::
570
+
571
+ from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
572
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, lazy_mode=True) # not on cpu
573
+ for name, lazy_tensor in state_dict.item():
574
+ tensor = lazy_tensor.contiguous() # to cpu
575
+ print(name, tensor)
576
+ # del tensor to release memory if it no longer in use
577
+ """
578
+ if tag is None:
579
+ latest_path = os.path.join(checkpoint_dir, 'latest')
580
+ if os.path.isfile(latest_path):
581
+ with open(latest_path, 'r') as fd:
582
+ tag = fd.read().strip()
583
+ else:
584
+ raise ValueError(f"Unable to find 'latest' file at {latest_path}")
585
+
586
+ ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
587
+
588
+ if not os.path.isdir(ds_checkpoint_dir):
589
+ raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
590
+
591
+ state_dict = _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
592
+ if lazy_mode:
593
+ return state_dict
594
+ else:
595
+ return to_torch_tensor(state_dict)
596
+
597
+
598
+ def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir,
599
+ output_dir,
600
+ max_shard_size="5GB",
601
+ safe_serialization=False,
602
+ tag=None,
603
+ exclude_frozen_parameters=False):
604
+ """
605
+ Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
606
+ loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
607
+
608
+ Args:
609
+ - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
610
+ - ``output_dir``: directory to the pytorch fp32 state_dict output files
611
+ - ``max_shard_size``: the maximum size for a checkpoint before being sharded, default value is 5GB
612
+ - ``safe_serialization``: whether to save the model using `safetensors` or the traditional PyTorch way (that uses `pickle`).
613
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
614
+ - ``exclude_frozen_parameters``: exclude frozen parameters
615
+ """
616
+
617
+ # Dependency pre-check
618
+ if safe_serialization:
619
+ try:
620
+ from safetensors.torch import save_file
621
+ except ImportError:
622
+ print('If you want to use `safe_serialization`, please `pip install safetensors`')
623
+ raise
624
+ if max_shard_size is not None:
625
+ try:
626
+ from huggingface_hub import split_torch_state_dict_into_shards
627
+ except ImportError:
628
+ print('If you want to use `max_shard_size`, please `pip install huggingface_hub`')
629
+ raise
630
+
631
+ # Convert zero checkpoint to state_dict
632
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir,
633
+ tag,
634
+ exclude_frozen_parameters,
635
+ lazy_mode=True)
636
+
637
+ # Shard the model if it is too big.
638
+ weights_name = "model.safetensors" if safe_serialization else "pytorch_model.bin"
639
+ if max_shard_size is not None:
640
+ filename_pattern = weights_name.replace(".bin", "{suffix}.bin").replace(".safetensors", "{suffix}.safetensors")
641
+ # an memory-efficient approach for sharding
642
+ empty_state_dict = to_torch_tensor(state_dict, return_empty_tensor=True)
643
+ state_dict_split = split_torch_state_dict_into_shards(empty_state_dict,
644
+ filename_pattern=filename_pattern,
645
+ max_shard_size=max_shard_size)
646
+ else:
647
+ from collections import namedtuple
648
+ StateDictSplit = namedtuple("StateDictSplit", ["is_sharded", "filename_to_tensors"])
649
+ state_dict_split = StateDictSplit(is_sharded=False,
650
+ filename_to_tensors={weights_name: list(state_dict.keys())})
651
+
652
+ # Save the model by shard
653
+ os.makedirs(output_dir, exist_ok=True)
654
+ filename_to_tensors = state_dict_split.filename_to_tensors.items()
655
+ for shard_file, tensors in tqdm(filename_to_tensors, desc="Saving checkpoint shards"):
656
+ shard_state_dict = {tensor_name: state_dict[tensor_name] for tensor_name in tensors}
657
+ shard_state_dict = to_torch_tensor(shard_state_dict)
658
+ output_path = os.path.join(output_dir, shard_file)
659
+ if safe_serialization:
660
+ save_file(shard_state_dict, output_path, metadata={"format": "pt"})
661
+ else:
662
+ torch.save(shard_state_dict, output_path)
663
+ # release the memory of current shard
664
+ for tensor_name in list(shard_state_dict.keys()):
665
+ del state_dict[tensor_name]
666
+ del shard_state_dict[tensor_name]
667
+ del shard_state_dict
668
+ gc.collect()
669
+
670
+ # Save index if sharded
671
+ if state_dict_split.is_sharded:
672
+ index = {
673
+ "metadata": state_dict_split.metadata,
674
+ "weight_map": state_dict_split.tensor_to_filename,
675
+ }
676
+ save_index_file = "model.safetensors.index.json" if safe_serialization else "pytorch_model.bin.index.json"
677
+ save_index_file = os.path.join(output_dir, save_index_file)
678
+ with open(save_index_file, "w", encoding="utf-8") as f:
679
+ content = json.dumps(index, indent=2, sort_keys=True) + "\n"
680
+ f.write(content)
681
+
682
+
683
+ def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
684
+ """
685
+ 1. Put the provided model to cpu
686
+ 2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
687
+ 3. Load it into the provided model
688
+
689
+ Args:
690
+ - ``model``: the model object to update
691
+ - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
692
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
693
+
694
+ Returns:
695
+ - ``model`: modified model
696
+
697
+ Make sure you have plenty of CPU memory available before you call this function. If you don't
698
+ have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
699
+ conveniently placed for you in the checkpoint folder.
700
+
701
+ A typical usage might be ::
702
+
703
+ from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
704
+ model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
705
+ # submit to model hub or save the model to share with others
706
+
707
+ Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
708
+ of the same application. i.e. you will need to re-initialize the deepspeed engine, since
709
+ ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
710
+
711
+ """
712
+ logger.info("Extracting fp32 weights")
713
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
714
+
715
+ logger.info("Overwriting model with fp32 weights")
716
+ model = model.cpu()
717
+ model.load_state_dict(state_dict, strict=False)
718
+
719
+ return model
720
+
721
+
722
+ if __name__ == "__main__":
723
+ parser = argparse.ArgumentParser()
724
+ parser.add_argument("checkpoint_dir",
725
+ type=str,
726
+ help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
727
+ parser.add_argument("output_dir",
728
+ type=str,
729
+ help="directory to the pytorch fp32 state_dict output files"
730
+ "(e.g. path/checkpoint-12-output/)")
731
+ parser.add_argument(
732
+ "--max_shard_size",
733
+ type=str,
734
+ default="5GB",
735
+ help="The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size"
736
+ "lower than this size. If expressed as a string, needs to be digits followed by a unit (like `5MB`"
737
+ "We default it to 5GB in order for models to be able to run easily on free-tier google colab instances"
738
+ "without CPU OOM issues.")
739
+ parser.add_argument(
740
+ "--safe_serialization",
741
+ default=False,
742
+ action='store_true',
743
+ help="Whether to save the model using `safetensors` or the traditional PyTorch way (that uses `pickle`).")
744
+ parser.add_argument("-t",
745
+ "--tag",
746
+ type=str,
747
+ default=None,
748
+ help="checkpoint tag used as a unique identifier for checkpoint. e.g., global_step1")
749
+ parser.add_argument("--exclude_frozen_parameters", action='store_true', help="exclude frozen parameters")
750
+ parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
751
+ args = parser.parse_args()
752
+
753
+ debug = args.debug
754
+
755
+ convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
756
+ args.output_dir,
757
+ max_shard_size=args.max_shard_size,
758
+ safe_serialization=args.safe_serialization,
759
+ tag=args.tag,
760
+ exclude_frozen_parameters=args.exclude_frozen_parameters)