pravsels commited on
Commit
ea849c4
·
verified ·
1 Parent(s): 1b5470d

Upload folder using huggingface_hub

Browse files
stage2_front_cam_step20000/HF_MODEL_CARD.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - pytorch
4
+ - world-model
5
+ - robotics
6
+ license: mit
7
+ ---
8
+
9
+ # Interactive World Sim Checkpoints
10
+
11
+ This repo hosts released checkpoint artifacts.
12
+
13
+ Latest uploaded artifact:
14
+
15
+ - `stage2_front_cam_step20000`
16
+ - job: `2875089`
17
+ - W&B: [Run `7skk0qh6`](https://wandb.ai/pravsels/interactive_world_sim/runs/7skk0qh6)
18
+ - note: run ended early after NaN-gradient event; checkpoint is kept as the current stage-2 baseline.
19
+
20
+ ## Files in `stage2_front_cam_step20000/`
21
+
22
+ - `checkpoints/epoch=0-step=20000.ckpt`
23
+ - `training_config_snapshot.yaml` (exact Hydra snapshot used for this run)
24
+ - `dataset_source.txt` (maps in-run mounted dataset paths to source paths)
25
+ - `checkpoint_metadata.json`
26
+ - `SHA256SUMS`
27
+ - `README.md`
28
+
29
+ ## Previous artifact
30
+
31
+ - `stage1_front_cam_step64000` ([Run `7gximny3`](https://wandb.ai/pravsels/interactive_world_sim/runs/7gximny3))
32
+
33
+ Use `SHA256SUMS` to verify artifact integrity after download.
stage2_front_cam_step20000/README.md ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Stage 2 Front-Cam Checkpoint (Isambard)
2
+
3
+ This folder packages the stage-2 latent dynamics checkpoint from the first high-throughput run segment.
4
+
5
+ ## Checkpoint
6
+
7
+ - file: `checkpoints/epoch=0-step=20000.ckpt`
8
+ - size_bytes: `232088523`
9
+ - sha256: `9f972dc4a805248d47c03f48a3cc1e4dbb3c85783e6ddbb58d2cdbfc5d4045e2`
10
+
11
+ ## Training Context
12
+
13
+ - project: `interactive_world_sim`
14
+ - cluster: `Isambard (GH200, arm64)`
15
+ - dataset: `WAN H5` (`camera_1_color` front cam)
16
+ - training_stage: `2` (latent dynamics)
17
+ - training precision: `16-mixed`
18
+ - training batch size: `32`
19
+ - train/val dataloader workers: `8/8`
20
+ - wandb mode during training: `offline` (then synced)
21
+
22
+ ## Provenance
23
+
24
+ - source run block: `job 2875089`
25
+ - synced W&B run:
26
+ [Run `7skk0qh6`](https://wandb.ai/pravsels/interactive_world_sim/runs/7skk0qh6)
27
+ - key loss trend:
28
+ `training/loss: 0.014964337 -> 3.7573023e-05` (min `2.3631907e-05`, global_step `99 -> 22999`)
29
+ - exact run config snapshot:
30
+ `training_config_snapshot.yaml`
31
+ - dataset source mapping:
32
+ `dataset_source.txt`
33
+
34
+ ## Notes
35
+
36
+ - Run ended before the configured max-steps after `NaN in gradient of module.out.1.weight` was emitted in Slurm logs.
37
+ - This checkpoint is the recommended stage-2 baseline for downstream evaluation and planning.
stage2_front_cam_step20000/SHA256SUMS ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ 9f972dc4a805248d47c03f48a3cc1e4dbb3c85783e6ddbb58d2cdbfc5d4045e2 checkpoints/epoch=0-step=20000.ckpt
2
+ 99d3ff7fcd5abb5740beefb604edfd9344389ad854d1d0172ca75bb3b0a87f3c training_config_snapshot.yaml
3
+ a9f0d2b0440888863b90678211088d2d632f902ea404b49272337e4e337a33c1 dataset_source.txt
stage2_front_cam_step20000/checkpoint_metadata.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "project": "interactive_world_sim",
3
+ "artifact_name": "stage2_front_cam_step20000",
4
+ "file_name": "checkpoints/epoch=0-step=20000.ckpt",
5
+ "size_bytes": 232088523,
6
+ "sha256": "9f972dc4a805248d47c03f48a3cc1e4dbb3c85783e6ddbb58d2cdbfc5d4045e2",
7
+ "training_stage": 2,
8
+ "obs_keys": [
9
+ "camera_1_color"
10
+ ],
11
+ "training_config_snapshot": {
12
+ "file": "training_config_snapshot.yaml",
13
+ "size_bytes": 5118,
14
+ "sha256": "99d3ff7fcd5abb5740beefb604edfd9344389ad854d1d0172ca75bb3b0a87f3c"
15
+ },
16
+ "dataset_source_mapping_file": "dataset_source.txt",
17
+ "source_job_id": "2875089",
18
+ "wandb_run_url": "https://wandb.ai/pravsels/interactive_world_sim/runs/7skk0qh6",
19
+ "final_training_loss": 3.7573023e-05,
20
+ "min_training_loss": 2.3631907e-05,
21
+ "global_step_first_last": [
22
+ 99,
23
+ 22999
24
+ ],
25
+ "run_outcome_note": "Ended early after NaN gradient event in Slurm logs."
26
+ }
stage2_front_cam_step20000/checkpoints/epoch=0-step=20000.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9f972dc4a805248d47c03f48a3cc1e4dbb3c85783e6ddbb58d2cdbfc5d4045e2
3
+ size 232088523
stage2_front_cam_step20000/dataset_source.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ /mnt/wan_dataset.h5 -> /scratch/u6cr/pravsels.u6cr/latent_safety/arx5_datasets_6Feb_26_wan224.h5
2
+ /mnt/wan_dataset_stats.json -> /scratch/u6cr/pravsels.u6cr/latent_safety/arx5_datasets_6Feb_26_stats.json
stage2_front_cam_step20000/training_config_snapshot.yaml ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ experiment:
2
+ debug: ${debug}
3
+ tasks:
4
+ - training
5
+ num_nodes: 1
6
+ num_devices: 1
7
+ training:
8
+ precision: 16-mixed
9
+ compile: false
10
+ lr: 8.0e-05
11
+ batch_size: 32
12
+ max_epochs: -1
13
+ max_steps: 1000005
14
+ max_time: null
15
+ data:
16
+ num_workers: 8
17
+ shuffle: true
18
+ optim:
19
+ accumulate_grad_batches: 1
20
+ gradient_clip_val: 1.0
21
+ checkpointing:
22
+ every_n_train_steps: 10000
23
+ every_n_epochs: null
24
+ train_time_interval: null
25
+ enable_version_counter: false
26
+ log_every_n_steps: 100
27
+ validation:
28
+ precision: 16-mixed
29
+ compile: false
30
+ batch_size: 2
31
+ val_every_n_step: 30000
32
+ val_every_n_epoch: null
33
+ limit_batch: 1.0
34
+ inference_mode: true
35
+ data:
36
+ num_workers: 8
37
+ shuffle: false
38
+ test:
39
+ precision: 16-mixed
40
+ compile: false
41
+ batch_size: 8
42
+ limit_batch: 1
43
+ data:
44
+ num_workers: 16
45
+ shuffle: false
46
+ logging:
47
+ metrics:
48
+ - fvd
49
+ dataset:
50
+ debug: ${debug}
51
+ h5_path: /mnt/wan_dataset.h5
52
+ dataset_dir: .
53
+ horizon: 10
54
+ val_horizon: 200
55
+ aug_mode: none
56
+ skip_frame: 1
57
+ pad_after: 7
58
+ pad_before: 1
59
+ seed: 42
60
+ val_ratio: 0.1
61
+ skip_idx: 1
62
+ resolution: 128
63
+ goal_sample: intermediate
64
+ stats_json_path: /mnt/wan_dataset_stats.json
65
+ action_key: actions_delta
66
+ state_key: states
67
+ camera_key_map:
68
+ camera_0_color: camera_0
69
+ camera_1_color: camera_1
70
+ lowdim_key_map:
71
+ joint_pos: states
72
+ obs_keys:
73
+ - camera_1_color
74
+ low_dim_keys: []
75
+ shape_meta:
76
+ action:
77
+ shape:
78
+ - 7
79
+ obs:
80
+ camera_0_color:
81
+ shape:
82
+ - 3
83
+ - 128
84
+ - 128
85
+ type: rgb
86
+ camera_1_color:
87
+ shape:
88
+ - 3
89
+ - 128
90
+ - 128
91
+ type: rgb
92
+ algorithm:
93
+ debug: ${debug}
94
+ lr: ${experiment.training.lr}
95
+ weight_decay: 0.0001
96
+ warmup_steps: 10000
97
+ lr_scheduler: linear
98
+ optimizer_beta:
99
+ - 0.9
100
+ - 0.999
101
+ latent_dim: 512
102
+ action_dim: 7
103
+ enc_dim: 64
104
+ num_components: 1
105
+ obs_keys: ${dataset.obs_keys}
106
+ x_shape:
107
+ - ${eval:'3 * len(${dataset.obs_keys})'}
108
+ - ${dataset.resolution}
109
+ - ${dataset.resolution}
110
+ norm_scale: 6.0
111
+ num_latent_downsample: 2
112
+ num_views: ${eval:'len(${dataset.obs_keys})'}
113
+ num_latent_channel: ${eval:'4 * ${algorithm.num_views}'}
114
+ latent_resolution: ${eval:'${dataset.resolution} // int(2 ** ${algorithm.num_latent_downsample})'}
115
+ training_stage: 2
116
+ load_ae: /workspace/outputs/2026-03-13/10-25-30/checkpoints/epoch=0-step=64000.ckpt
117
+ dtype: ${torch:float}
118
+ mask_prev_action: false
119
+ device: cuda
120
+ noise_level: log_normal
121
+ val_render: false
122
+ scheduling_matrix: autoregressive
123
+ uncertainty_scale: 1.0
124
+ guidance_scale: 1.0
125
+ n_frames: ${dataset.horizon}
126
+ dyn_infer_steps: 1
127
+ dec_infer_steps: 3
128
+ last_frame_loss_only: false
129
+ prev_frame_noise_scale: 0.1
130
+ robust_latent: false
131
+ delta: ${eval:'0.00054 * ${algorithm.num_latent_channel} * ${algorithm.latent_resolution}
132
+ * ${algorithm.latent_resolution}'}
133
+ sampling_strategy: terminal_only
134
+ sampling_strategy_params: []
135
+ dynamics:
136
+ _target_: interactive_world_sim.algorithms.latent_dynamics.models.cm_latent_dynamics.CMLatentDynamics
137
+ action_dim: ${algorithm.action_dim}
138
+ latent_dim: ${algorithm.num_latent_channel}
139
+ dim: 64
140
+ action_emb_dim: 512
141
+ resnet_block_groups: 8
142
+ dim_mults:
143
+ - 1
144
+ - 2
145
+ attn_dim_head: 128
146
+ attn_heads: 4
147
+ use_linear_attn: true
148
+ use_init_temporal_attn: true
149
+ init_kernel_size: 5
150
+ is_causal: true
151
+ time_emb_type: rotary
152
+ dtype: ${algorithm.dtype}
153
+ noise_scheduler:
154
+ _target_: interactive_world_sim.utils.cm_utils.DDPMScheduler
155
+ x_shape: ${algorithm.x_shape}
156
+ timesteps: ${algorithm.diffusion.timesteps}
157
+ sampling_timesteps: ${algorithm.diffusion.sampling_timesteps}
158
+ beta_schedule: ${algorithm.diffusion.beta_schedule}
159
+ schedule_fn_kwargs: ${algorithm.diffusion.schedule_fn_kwargs}
160
+ objective: ${algorithm.diffusion.objective}
161
+ loss_weighting: uniform
162
+ snr_clip: ${algorithm.diffusion.snr_clip}
163
+ cum_snr_decay: ${algorithm.diffusion.cum_snr_decay}
164
+ ddim_sampling_eta: ${algorithm.diffusion.ddim_sampling_eta}
165
+ clip_noise: ${algorithm.diffusion.clip_noise}
166
+ stabilization_level: ${algorithm.diffusion.stabilization_level}
167
+ dtype: ${algorithm.dtype}
168
+ diffusion:
169
+ beta_schedule: sigmoid
170
+ objective: pred_v
171
+ use_fused_snr: true
172
+ cum_snr_decay: 0.96
173
+ clip_noise: 6.0
174
+ schedule_fn_kwargs: {}
175
+ timesteps: 1000
176
+ sampling_timesteps: 50
177
+ ddim_sampling_eta: 0.0
178
+ snr_clip: 5.0
179
+ model_channels: ${algorithm.enc_dim}
180
+ num_latent_downsample: ${algorithm.num_latent_downsample}
181
+ num_latent_channel: ${algorithm.num_latent_channel}
182
+ num_res_blocks: 2
183
+ attention_resolutions:
184
+ - 2
185
+ - 4
186
+ - 8
187
+ dropout: 0.1
188
+ channel_mult:
189
+ - 1
190
+ - 2
191
+ - 3
192
+ num_head_channels: 64
193
+ resblock_updown: true
194
+ use_scale_shift_norm: true
195
+ num_components: ${algorithm.num_components}
196
+ image_size: ${dataset.resolution}
197
+ stabilization_level: 15
198
+ metrics:
199
+ - fvd
200
+ debug: false
201
+ wandb:
202
+ entity: pravsels
203
+ project: interactive_world_sim
204
+ mode: offline
205
+ resume: null
206
+ load: null
207
+ name: stage2_front_cam