StrongRoboticsLab
/

pi05-so100-diverse

+=== Starting Training ===
+The following values were not passed to `accelerate launch` and had defaults used instead:
+	`--num_processes` was set to a value of `1`
+	`--num_machines` was set to a value of `1`
+	`--mixed_precision` was set to a value of `'no'`
+	`--dynamo_backend` was set to a value of `'no'`
+To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
+WARNING:lerobot.configs.policies:Device 'mps' is not available. Switching to 'cuda'.
+WARNING:lerobot.configs.policies:Device 'mps' is not available. Switching to 'cuda'.
+INFO 2026-03-24 02:43:56 ot_train.py:198 {'batch_size': 32,
+ 'checkpoint_path': None,
+ 'cudnn_deterministic': False,
+ 'dataset': {'episodes': None,
+             'image_transforms': {'enable': False,
+                                  'max_num_transforms': 3,
+                                  'random_order': False,
+                                  'tfs': {'affine': {'kwargs': {'degrees': [-5.0,
+                                                                            5.0],
+                                                                'translate': [0.05,
+                                                                              0.05]},
+                                                     'type': 'RandomAffine',
+                                                     'weight': 1.0},
+                                          'brightness': {'kwargs': {'brightness': [0.8,
+                                                                                   1.2]},
+                                                         'type': 'ColorJitter',
+                                                         'weight': 1.0},
+                                          'contrast': {'kwargs': {'contrast': [0.8,
+                                                                               1.2]},
+                                                       'type': 'ColorJitter',
+                                                       'weight': 1.0},
+                                          'hue': {'kwargs': {'hue': [-0.05,
+                                                                     0.05]},
+                                                  'type': 'ColorJitter',
+                                                  'weight': 1.0},
+                                          'saturation': {'kwargs': {'saturation': [0.5,
+                                                                                   1.5]},
+                                                         'type': 'ColorJitter',
+                                                         'weight': 1.0},
+                                          'sharpness': {'kwargs': {'sharpness': [0.5,
+                                                                                 1.5]},
+                                                        'type': 'SharpnessJitter',
+                                                        'weight': 1.0}}},
+             'repo_id': 'so100:/ephemeral/community_dataset_v3:/workspace/pi05-so100-diverse/filtered_index.json:/workspace/pi05-so100-diverse/norm_stats.json',
+             'revision': None,
+             'root': None,
+             'streaming': False,
+             'use_imagenet_stats': True,
+             'video_backend': 'torchcodec'},
+ 'env': None,
+ 'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': False},
+ 'eval_freq': 20000,
+ 'job_name': 'pi05',
+ 'log_freq': 50,
+ 'num_workers': 4,
+ 'optimizer': {'betas': [0.9, 0.95],
+               'eps': 1e-08,
+               'grad_clip_norm': 1.0,
+               'lr': 2.5e-05,
+               'type': 'adamw',
+               'weight_decay': 0.01},
+ 'output_dir': '/ephemeral/production_run',
+ 'peft': None,
+ 'policy': {'action_expert_variant': 'gemma_300m',
+            'chunk_size': 50,
+            'compile_mode': 'max-autotune',
+            'compile_model': False,
+            'device': 'cuda',
+            'dtype': 'bfloat16',
+            'empty_cameras': 0,
+            'freeze_vision_encoder': False,
+            'gradient_checkpointing': False,
+            'image_resolution': [224, 224],
+            'input_features': {'observation.images.base_0_rgb': {'shape': [3,
+                                                                           224,
+                                                                           224],
+                                                                 'type': <FeatureType.VISUAL: 'VISUAL'>},
+                               'observation.images.left_wrist_0_rgb': {'shape': [3,
+                                                                                 224,
+                                                                                 224],
+                                                                       'type': <FeatureType.VISUAL: 'VISUAL'>},
+                               'observation.images.right_wrist_0_rgb': {'shape': [3,
+                                                                                  224,
+                                                                                  224],
+                                                                        'type': <FeatureType.VISUAL: 'VISUAL'>},
+                               'observation.state': {'shape': [32],
+                                                     'type': <FeatureType.STATE: 'STATE'>}},
+            'license': None,
+            'max_action_dim': 32,
+            'max_period': 4.0,
+            'max_state_dim': 32,
+            'min_period': 0.004,
+            'n_action_steps': 50,
+            'n_obs_steps': 1,
+            'normalization_mapping': {'ACTION': <NormalizationMode.MEAN_STD: 'MEAN_STD'>,
+                                      'STATE': <NormalizationMode.MEAN_STD: 'MEAN_STD'>,
+                                      'VISUAL': <NormalizationMode.IDENTITY: 'IDENTITY'>},
+            'num_inference_steps': 10,
+            'optimizer_betas': [0.9, 0.95],
+            'optimizer_eps': 1e-08,
+            'optimizer_grad_clip_norm': 1.0,
+            'optimizer_lr': 2.5e-05,
+            'optimizer_weight_decay': 0.01,
+            'output_features': {'action': {'shape': [32],
+                                           'type': <FeatureType.ACTION: 'ACTION'>}},
+            'paligemma_variant': 'gemma_2b',
+            'pretrained_path': 'lerobot/pi05_base',
+            'private': None,
+            'push_to_hub': True,
+            'repo_id': 'StrongRoboticsLab/pi05-so100-diverse',
+            'rtc_config': None,
+            'scheduler_decay_lr': 2.5e-06,
+            'scheduler_decay_steps': 170000,
+            'scheduler_warmup_steps': 1000,
+            'tags': None,
+            'time_sampling_beta_alpha': 1.5,
+            'time_sampling_beta_beta': 1.0,
+            'time_sampling_offset': 0.001,
+            'time_sampling_scale': 0.999,
+            'tokenizer_max_length': 200,
+            'train_expert_only': True,
+            'type': 'pi05',
+            'use_amp': False,
+            'use_peft': False},
+ 'rabc_epsilon': 1e-06,
+ 'rabc_head_mode': 'sparse',
+ 'rabc_kappa': 0.01,
+ 'rabc_progress_path': None,
+ 'rename_map': {'observation.images.image': 'observation.images.base_0_rgb',
+                'observation.images.image2': 'observation.images.left_wrist_0_rgb'},
+ 'resume': False,
+ 'save_checkpoint': True,
+ 'save_freq': 500,
+ 'scheduler': {'decay_lr': 2.5e-06,
+               'num_decay_steps': 170000,
+               'num_warmup_steps': 1000,
+               'peak_lr': 2.5e-05,
+               'type': 'cosine_decay_with_warmup'},
+ 'seed': 1000,
+ 'steps': 170000,
+ 'tolerance_s': 0.0001,
+ 'use_policy_training_preset': True,
+ 'use_rabc': False,
+ 'wandb': {'add_tags': True,
+           'disable_artifact': False,
+           'enable': True,
+           'entity': None,
+           'mode': None,
+           'notes': None,
+           'project': 'pi05-so100-diverse',
+           'run_id': None}}
+INFO 2026-03-24 02:43:57 db_utils.py:117 Logs will be synced with wandb.
+INFO 2026-03-24 02:43:57 db_utils.py:118 Track this run --> https://wandb.ai/gptjustin-strong-robotics-lab/pi05-so100-diverse/runs/pajke257
+INFO 2026-03-24 02:43:57 ot_train.py:222 Creating dataset
+INFO 2026-03-24 02:44:00 ot_train.py:240 Creating policy
+INFO 2026-03-24 02:46:22 _client.py:1025 HTTP Request: HEAD https://huggingface.co/lerobot/pi05_base/resolve/main/model.safetensors "HTTP/1.1 302 Found"
+INFO 2026-03-24 02:46:22 _client.py:1025 HTTP Request: GET https://huggingface.co/api/models/lerobot/pi05_base/xet-read-token/9e55186ad36e66b95cda57bc47818d9e6237ae30 "HTTP/1.1 200 OK"
+WARNING 2026-03-24 02:46:40 ng_pi05.py:1102 Vision embedding key might need handling: paligemma_with_expert.paligemma.model.vision_tower.vision_model.embeddings.patch_embedding.bias
+WARNING 2026-03-24 02:46:40 ng_pi05.py:1102 Vision embedding key might need handling: paligemma_with_expert.paligemma.model.vision_tower.vision_model.embeddings.patch_embedding.weight
+INFO 2026-03-24 02:46:46 _client.py:1025 HTTP Request: HEAD https://huggingface.co/lerobot/pi05_base/resolve/main/policy_preprocessor.json "HTTP/1.1 307 Temporary Redirect"
+INFO 2026-03-24 02:46:47 _client.py:1025 HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/lerobot/pi05_base/9e55186ad36e66b95cda57bc47818d9e6237ae30/policy_preprocessor.json "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:47 _client.py:1025 HTTP Request: GET https://huggingface.co/api/resolve-cache/models/lerobot/pi05_base/9e55186ad36e66b95cda57bc47818d9e6237ae30/policy_preprocessor.json "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:47 _client.py:1025 HTTP Request: HEAD https://huggingface.co/google/paligemma-3b-pt-224/resolve/main/config.json "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:47 _client.py:1025 HTTP Request: GET https://huggingface.co/google/paligemma-3b-pt-224/resolve/main/config.json "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:47 _client.py:1025 HTTP Request: HEAD https://huggingface.co/google/paligemma-3b-pt-224/resolve/main/tokenizer_config.json "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:47 _client.py:1025 HTTP Request: GET https://huggingface.co/google/paligemma-3b-pt-224/resolve/main/tokenizer_config.json "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:47 _client.py:1025 HTTP Request: GET https://huggingface.co/api/models/google/paligemma-3b-pt-224/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
+INFO 2026-03-24 02:46:47 _client.py:1025 HTTP Request: GET https://huggingface.co/api/models/google/paligemma-3b-pt-224/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:47 _client.py:1025 HTTP Request: HEAD https://huggingface.co/google/paligemma-3b-pt-224/resolve/main/tokenizer.json "HTTP/1.1 302 Found"
+INFO 2026-03-24 02:46:47 _client.py:1025 HTTP Request: GET https://huggingface.co/api/models/google/paligemma-3b-pt-224/xet-read-token/35e4f46485b4d07967e7e9935bc3786aad50687c "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:48 _client.py:1025 HTTP Request: HEAD https://huggingface.co/google/paligemma-3b-pt-224/resolve/main/added_tokens.json "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:48 _client.py:1025 HTTP Request: GET https://huggingface.co/google/paligemma-3b-pt-224/resolve/main/added_tokens.json "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:48 _client.py:1025 HTTP Request: HEAD https://huggingface.co/google/paligemma-3b-pt-224/resolve/main/special_tokens_map.json "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:48 _client.py:1025 HTTP Request: GET https://huggingface.co/google/paligemma-3b-pt-224/resolve/main/special_tokens_map.json "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:48 _client.py:1025 HTTP Request: HEAD https://huggingface.co/google/paligemma-3b-pt-224/resolve/main/chat_template.jinja "HTTP/1.1 404 Not Found"
+INFO 2026-03-24 02:46:49 _client.py:1025 HTTP Request: GET https://huggingface.co/api/models/google/paligemma-3b-pt-224 "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:49 _client.py:1025 HTTP Request: HEAD https://huggingface.co/lerobot/pi05_base/resolve/main/policy_postprocessor.json "HTTP/1.1 307 Temporary Redirect"
+INFO 2026-03-24 02:46:49 _client.py:1025 HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/lerobot/pi05_base/9e55186ad36e66b95cda57bc47818d9e6237ae30/policy_postprocessor.json "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:49 _client.py:1025 HTTP Request: GET https://huggingface.co/api/resolve-cache/models/lerobot/pi05_base/9e55186ad36e66b95cda57bc47818d9e6237ae30/policy_postprocessor.json "HTTP/1.1 200 OK"
+INFO 2026-03-24 02:46:49 ot_train.py:295 Creating optimizer and scheduler
+INFO 2026-03-24 02:46:49 ot_train.py:330 Output dir: /ephemeral/production_run
+INFO 2026-03-24 02:46:49 ot_train.py:337 cfg.steps=170000 (170K)
+INFO 2026-03-24 02:46:49 ot_train.py:338 dataset.num_frames=4923840 (5M)
+INFO 2026-03-24 02:46:49 ot_train.py:339 dataset.num_episodes=10155
+INFO 2026-03-24 02:46:49 ot_train.py:342 Effective batch size: 32 x 1 = 32
+INFO 2026-03-24 02:46:49 ot_train.py:343 num_learnable_params=693422112 (693M)
+INFO 2026-03-24 02:46:49 ot_train.py:344 num_total_params=4143404816 (4B)
+The PI05 model is a direct port of the OpenPI implementation.
+This implementation follows the original OpenPI structure for compatibility.
+Original implementation: https://github.com/Physical-Intelligence/openpi
+Loading model from: lerobot/pi05_base
+✓ Loaded state dict from model.safetensors
+Remapped 812 state dict keys
+All keys loaded successfully!
+Traceback (most recent call last):
+  File "<frozen runpy>", line 198, in _run_module_as_main
+  File "<frozen runpy>", line 88, in _run_code
+  File "/workspace/pi05-so100-diverse/lerobot/src/lerobot/scripts/lerobot_train.py", line 575, in <module>
+    main()
+  File "/workspace/pi05-so100-diverse/lerobot/src/lerobot/scripts/lerobot_train.py", line 571, in main
+    train()
+  File "/workspace/pi05-so100-diverse/lerobot/src/lerobot/configs/parser.py", line 233, in wrapper_inner
+    response = fn(cfg, *args, **kwargs)
+               ^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/workspace/pi05-so100-diverse/lerobot/src/lerobot/scripts/lerobot_train.py", line 419, in train
+    train_tracker, output_dict = update_policy(
+                                 ^^^^^^^^^^^^^^
+  File "/workspace/pi05-so100-diverse/lerobot/src/lerobot/scripts/lerobot_train.py", line 118, in update_policy
+    loss, output_dict = policy.forward(batch)
+                        ^^^^^^^^^^^^^^^^^^^^^
+  File "/workspace/pi05-so100-diverse/lerobot/src/lerobot/policies/pi05/modeling_pi05.py", line 1264, in forward
+    losses = self.model.forward(images, img_masks, tokens, masks, actions)
+             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/workspace/pi05-so100-diverse/lerobot/src/lerobot/policies/pi05/modeling_pi05.py", line 771, in forward
+    suffix_out = self._apply_checkpoint(
+                 ^^^^^^^^^^^^^^^^^^^^^^^
+  File "/workspace/pi05-so100-diverse/lerobot/src/lerobot/policies/pi05/modeling_pi05.py", line 617, in _apply_checkpoint
+    return func(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^
+  File "/workspace/pi05-so100-diverse/lerobot/src/lerobot/policies/pi05/modeling_pi05.py", line 761, in forward_func
+    (_, suffix_out), _ = self.paligemma_with_expert.forward(
+                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/workspace/pi05-so100-diverse/lerobot/src/lerobot/policies/pi05/modeling_pi05.py", line 513, in forward
+    inputs_embeds = compute_layer_complete(
+                    ^^^^^^^^^^^^^^^^^^^^^^^
+  File "/workspace/pi05-so100-diverse/lerobot/src/lerobot/policies/pi05/modeling_pi05.py", line 264, in compute_layer_complete
+    att_output, _ = modeling_gemma.eager_attention_forward(
+                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.12/dist-packages/transformers/models/gemma/modeling_gemma.py", line 210, in eager_attention_forward
+    attn_weights = attn_weights + attention_mask
+                   ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~
+torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1014.00 MiB. GPU 0 has a total capacity of 79.19 GiB of which 222.50 MiB is free. Including non-PyTorch memory, this process has 78.96 GiB memory in use. Of the allocated memory 76.51 GiB is allocated by PyTorch, and 1.86 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
+Traceback (most recent call last):
+  File "<frozen runpy>", line 198, in _run_module_as_main
+  File "<frozen runpy>", line 88, in _run_code
+  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 1415, in <module>
+    main()
+  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 1411, in main
+    launch_command(args)
+  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 1405, in launch_command
+    simple_launcher(args)
+  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 993, in simple_launcher
+    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
+subprocess.CalledProcessError: Command '['/usr/bin/python3.12', '-m', 'lerobot.scripts.lerobot_train', '--dataset.repo_id=so100:/ephemeral/community_dataset_v3:/workspace/pi05-so100-diverse/filtered_index.json:/workspace/pi05-so100-diverse/norm_stats.json', '--policy.path=lerobot/pi05_base', '--policy.train_expert_only=true', '--policy.dtype=bfloat16', '--policy.gradient_checkpointing=false', '--policy.push_to_hub=true', '--policy.repo_id=StrongRoboticsLab/pi05-so100-diverse', '--policy.normalization_mapping={"VISUAL": "IDENTITY", "STATE": "MEAN_STD", "ACTION": "MEAN_STD"}', '--policy.scheduler_warmup_steps=1000', '--policy.scheduler_decay_steps=170000', '--rename_map={"observation.images.image": "observation.images.base_0_rgb", "observation.images.image2": "observation.images.left_wrist_0_rgb"}', '--batch_size=32', '--steps=170000', '--save_freq=500', '--log_freq=50', '--num_workers=4', '--wandb.enable=true', '--wandb.project=pi05-so100-diverse', '--output_dir=/ephemeral/production_run']' returned non-zero exit status 1.
+=== Training Complete (exit: 1) ===