calebescobedo commited on
Commit
ae56ac6
·
verified ·
1 Parent(s): 9775a19

Upload sensor diffusion model - 60 epochs completed

Browse files
README.md ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - robotics
5
+ - imitation-learning
6
+ - diffusion-policy
7
+ - manipulation
8
+ - hiro-robot
9
+ - lerobot
10
+ - goal-conditioned
11
+ - sensor-diffusion
12
+ datasets:
13
+ - roboset_20260112_225816
14
+ - roboset_20260113_001336
15
+ ---
16
+
17
+ # Proximity Sensor Goal-Conditioned Diffusion Policy
18
+
19
+ ## Model Description
20
+
21
+ A goal-conditioned Diffusion Policy trained on proximity sensor datasets. The model predicts joint positions (next positions along trajectory) conditioned on the current observation (joint positions, table camera image, encoded proximity sensor data) and a goal cartesian position.
22
+
23
+ **Key Features:**
24
+ - Uses 37 proximity sensors (8x8 depth maps) encoded to 128-dim latent via pretrained autoencoder
25
+ - Visual input from table camera (480x640 RGB)
26
+ - Goal-conditioned for reaching target cartesian positions
27
+ - Predicts 16-step action horizon
28
+
29
+ ## Model Architecture
30
+
31
+ - **Policy Type**: Diffusion Policy
32
+ - **Framework**: LeRobot
33
+ - **Horizon**: 16 steps
34
+ - **Observation Steps**: 1 step (single timestep)
35
+ - **Action Steps**: 8 steps (each covers 2 timesteps)
36
+ - **Total Parameters**: ~261M
37
+
38
+ ## Inputs
39
+
40
+ - **`observation.state`**: Shape `(batch, 1, 7)` - Joint positions (7 DOF arm)
41
+ - **`observation.goal`**: Shape `(batch, 1, 3)` - Goal cartesian position (X, Y, Z)
42
+ - **`observation.images.table_camera`**: Shape `(batch, 1, 3, 480, 640)` - Table camera RGB images
43
+ - **`observation.proximity`**: Shape `(batch, 1, 128)` - Encoded proximity sensor latent (37 sensors → 128-dim via pretrained encoder)
44
+
45
+ ## Outputs
46
+
47
+ - **`action`**: Shape `(batch, 16, 7)` - Joint positions (7 DOF) for 16-step horizon (next positions along trajectory)
48
+
49
+ **Note**: The model outputs a full 16-step horizon. Use `select_action()` to get the first step `(batch, 7)`, or `predict_action_chunk()` to get the full horizon `(batch, 16, 7)`.
50
+
51
+ ## Normalization
52
+
53
+ ### Input Normalization
54
+
55
+ **Images** (`observation.images.table_camera`):
56
+ - Normalize from `[0, 255]` to `[0, 1]` by dividing by `255.0`
57
+ - Then apply mean-std normalization using dataset statistics (handled by preprocessor)
58
+
59
+ **State** (`observation.state`):
60
+ - Apply min-max normalization: `(state - min) / (max - min)` using dataset statistics (handled by preprocessor)
61
+
62
+ **Goal** (`observation.goal`):
63
+ - Apply min-max normalization: `(goal - min) / (max - min)` using dataset statistics (handled by preprocessor)
64
+
65
+ **Proximity** (`observation.proximity`):
66
+ - Encoded via pretrained ProximityAutoencoder (frozen encoder)
67
+ - 37 sensors × (8×8 depth maps) → 128-dim latent
68
+ - Apply min-max normalization using dataset statistics (handled by preprocessor)
69
+
70
+ ### Output Unnormalization
71
+
72
+ **Actions** (`action`):
73
+ - Apply inverse min-max normalization: `action * (max - min) + min` using dataset statistics (handled by postprocessor)
74
+ - **Note**: Actions are joint positions (not velocities) - these are the next positions the robot should move to along the trajectory
75
+
76
+ ## Usage
77
+
78
+ ```python
79
+ from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
80
+ from lerobot.policies.factory import make_pre_post_processors
81
+
82
+ # Load model
83
+ policy = DiffusionPolicy.from_pretrained("calebescobedo/sensor-diffusion-policy-topdown-camera")
84
+
85
+ # Load preprocessor and postprocessor from the same repo
86
+ preprocessor, postprocessor = make_pre_post_processors(
87
+ policy_cfg=policy.config,
88
+ pretrained_path="calebescobedo/sensor-diffusion-policy-topdown-camera"
89
+ )
90
+
91
+ # Prepare inputs
92
+ batch = {
93
+ 'observation.state': state_tensor, # (batch, 1, 7) - raw joint positions
94
+ 'observation.goal': goal_tensor, # (batch, 1, 3) - raw goal xyz
95
+ 'observation.images.table_camera': table_img, # (batch, 1, 3, 480, 640) - uint8 [0,255] or float [0,1]
96
+ 'observation.proximity': proximity_latent, # (batch, 1, 128) - encoded proximity sensor latent
97
+ }
98
+
99
+ # Inference
100
+ policy.eval()
101
+ with torch.no_grad():
102
+ batch = preprocessor(batch) # Normalizes inputs
103
+ actions = policy.select_action(batch) # Returns normalized actions
104
+ actions = postprocessor(actions) # Unnormalizes to raw joint positions
105
+ ```
106
+
107
+ ## Training Details
108
+
109
+ - **Training**: Epoch-based (ensures all trajectories seen)
110
+ - **Epochs**: 60
111
+ - **Batch Size**: 64
112
+ - **Optimizer**: Adam (LeRobot preset)
113
+ - **Learning Rate**: From LeRobot optimizer preset
114
+ - **Mixed Precision**: Enabled (AMP)
115
+ - **Data Loading**: Optimized with persistent file handles (4 workers, prefetch=2)
116
+ - **Data Augmentation**:
117
+ - State noise: 30% probability, scale=0.005
118
+ - Action noise: 30% probability, scale=0.0005
119
+ - Goal noise: 30% probability, scale=[0.003, 0.005, 0.0005] (X, Y, Z)
120
+ - **Datasets**:
121
+ - roboset_20260117_014645 (20 H5 files, ~500 trajectories, ~17,000 sequences)
122
+
123
+ ## Proximity Sensor Encoding
124
+
125
+ The proximity sensors are encoded using a pretrained autoencoder:
126
+ - **Encoder**: 37 sensors × (8×8 depth maps) → 128-dim latent
127
+ - **Architecture**: Per-sensor CNN (8×8 → 4×4 → 2×2) + Multi-head attention aggregation
128
+ - **Training**: Separate pretraining on depth reconstruction (MSE loss: ~0.118)
129
+ - **Status**: Encoder frozen during policy training (no gradients)
130
+
131
+ ## Dataset Notes
132
+
133
+ - **37 proximity sensors** per timestep (depth_sensor_link1_sensor_0 through depth_sensor_link6_sensor_7)
134
+ - Each sensor provides **8×8 depth maps** (`depth_to_camera`)
135
+ - **Table camera RGB images** (480×640×3)
136
+ - **7-DOF joint positions**
137
+ - **Goal-conditioned trajectories**: Each trajectory has a unique goal (final cartesian position)
138
+ - **Goal distribution**:
139
+ - X: [-0.239, 0.294] meters
140
+ - Y: [-0.284, 0.317] meters
141
+ - Z: [0.364, 0.579] meters
142
+ - **Total**: ~500 trajectories, ~17,000 sequences
143
+
144
+ ## License
145
+
146
+ MIT License
config.json ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "type": "diffusion",
3
+ "n_obs_steps": 1,
4
+ "input_features": {
5
+ "observation.state": {
6
+ "type": "STATE",
7
+ "shape": [
8
+ 7
9
+ ]
10
+ },
11
+ "observation.goal": {
12
+ "type": "STATE",
13
+ "shape": [
14
+ 3
15
+ ]
16
+ },
17
+ "observation.images.top_down_camera": {
18
+ "type": "VISUAL",
19
+ "shape": [
20
+ 3,
21
+ 480,
22
+ 640
23
+ ]
24
+ },
25
+ "observation.proximity": {
26
+ "type": "STATE",
27
+ "shape": [
28
+ 128
29
+ ]
30
+ }
31
+ },
32
+ "output_features": {
33
+ "action": {
34
+ "type": "ACTION",
35
+ "shape": [
36
+ 7
37
+ ]
38
+ }
39
+ },
40
+ "device": "cuda",
41
+ "use_amp": false,
42
+ "push_to_hub": true,
43
+ "repo_id": null,
44
+ "private": null,
45
+ "tags": null,
46
+ "license": null,
47
+ "pretrained_path": null,
48
+ "horizon": 16,
49
+ "n_action_steps": 8,
50
+ "normalization_mapping": {
51
+ "VISUAL": "MEAN_STD",
52
+ "STATE": "MIN_MAX",
53
+ "ACTION": "MIN_MAX"
54
+ },
55
+ "drop_n_last_frames": 7,
56
+ "vision_backbone": "resnet18",
57
+ "crop_shape": null,
58
+ "crop_is_random": true,
59
+ "pretrained_backbone_weights": null,
60
+ "use_group_norm": true,
61
+ "spatial_softmax_num_keypoints": 32,
62
+ "use_separate_rgb_encoder_per_camera": false,
63
+ "down_dims": [
64
+ 512,
65
+ 1024,
66
+ 2048
67
+ ],
68
+ "kernel_size": 5,
69
+ "n_groups": 8,
70
+ "diffusion_step_embed_dim": 128,
71
+ "use_film_scale_modulation": true,
72
+ "noise_scheduler_type": "DDPM",
73
+ "num_train_timesteps": 100,
74
+ "beta_schedule": "squaredcos_cap_v2",
75
+ "beta_start": 0.0001,
76
+ "beta_end": 0.02,
77
+ "prediction_type": "epsilon",
78
+ "clip_sample": true,
79
+ "clip_sample_range": 1.0,
80
+ "num_inference_steps": null,
81
+ "do_mask_loss_for_padding": false,
82
+ "optimizer_lr": 0.0001,
83
+ "optimizer_betas": [
84
+ 0.95,
85
+ 0.999
86
+ ],
87
+ "optimizer_eps": 1e-08,
88
+ "optimizer_weight_decay": 1e-06,
89
+ "scheduler_name": "cosine",
90
+ "scheduler_warmup_steps": 500
91
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e87ec522642979d34a6a3596d1c4c45b321f4c6febbcc7aee11f277371937d0
3
+ size 1043939492
policy_postprocessor.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "policy_postprocessor",
3
+ "steps": [
4
+ {
5
+ "registry_name": "unnormalizer_processor",
6
+ "config": {
7
+ "eps": 1e-08,
8
+ "features": {
9
+ "action": {
10
+ "type": "ACTION",
11
+ "shape": [
12
+ 7
13
+ ]
14
+ }
15
+ },
16
+ "norm_map": {
17
+ "VISUAL": "MEAN_STD",
18
+ "STATE": "MIN_MAX",
19
+ "ACTION": "MIN_MAX"
20
+ }
21
+ },
22
+ "state_file": "policy_postprocessor_step_0_unnormalizer_processor.safetensors"
23
+ },
24
+ {
25
+ "registry_name": "device_processor",
26
+ "config": {
27
+ "device": "cpu",
28
+ "float_dtype": null
29
+ }
30
+ }
31
+ ]
32
+ }
policy_postprocessor_step_0_unnormalizer_processor.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:28463b3b064269f62d7f2a8795a066930dbc0d7b31d994ad868ad609435453b1
3
+ size 3552
policy_preprocessor.json ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "policy_preprocessor",
3
+ "steps": [
4
+ {
5
+ "registry_name": "rename_observations_processor",
6
+ "config": {
7
+ "rename_map": {}
8
+ }
9
+ },
10
+ {
11
+ "registry_name": "to_batch_processor",
12
+ "config": {}
13
+ },
14
+ {
15
+ "registry_name": "device_processor",
16
+ "config": {
17
+ "device": "cuda",
18
+ "float_dtype": null
19
+ }
20
+ },
21
+ {
22
+ "registry_name": "normalizer_processor",
23
+ "config": {
24
+ "eps": 1e-08,
25
+ "features": {
26
+ "observation.state": {
27
+ "type": "STATE",
28
+ "shape": [
29
+ 7
30
+ ]
31
+ },
32
+ "observation.goal": {
33
+ "type": "STATE",
34
+ "shape": [
35
+ 3
36
+ ]
37
+ },
38
+ "observation.images.top_down_camera": {
39
+ "type": "VISUAL",
40
+ "shape": [
41
+ 3,
42
+ 480,
43
+ 640
44
+ ]
45
+ },
46
+ "observation.proximity": {
47
+ "type": "STATE",
48
+ "shape": [
49
+ 128
50
+ ]
51
+ },
52
+ "action": {
53
+ "type": "ACTION",
54
+ "shape": [
55
+ 7
56
+ ]
57
+ }
58
+ },
59
+ "norm_map": {
60
+ "VISUAL": "MEAN_STD",
61
+ "STATE": "MIN_MAX",
62
+ "ACTION": "MIN_MAX"
63
+ }
64
+ },
65
+ "state_file": "policy_preprocessor_step_3_normalizer_processor.safetensors"
66
+ }
67
+ ]
68
+ }
policy_preprocessor_step_3_normalizer_processor.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:28463b3b064269f62d7f2a8795a066930dbc0d7b31d994ad868ad609435453b1
3
+ size 3552