File size: 5,576 Bytes
ae56ac6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
license: mit
tags:
- robotics
- imitation-learning
- diffusion-policy
- manipulation
- hiro-robot
- lerobot
- goal-conditioned
- sensor-diffusion
datasets:
- roboset_20260112_225816
- roboset_20260113_001336
---

# Proximity Sensor Goal-Conditioned Diffusion Policy

## Model Description

A goal-conditioned Diffusion Policy trained on proximity sensor datasets. The model predicts joint positions (next positions along trajectory) conditioned on the current observation (joint positions, table camera image, encoded proximity sensor data) and a goal cartesian position.

**Key Features:**
- Uses 37 proximity sensors (8x8 depth maps) encoded to 128-dim latent via pretrained autoencoder
- Visual input from table camera (480x640 RGB)
- Goal-conditioned for reaching target cartesian positions
- Predicts 16-step action horizon

## Model Architecture

- **Policy Type**: Diffusion Policy
- **Framework**: LeRobot
- **Horizon**: 16 steps
- **Observation Steps**: 1 step (single timestep)
- **Action Steps**: 8 steps (each covers 2 timesteps)
- **Total Parameters**: ~261M

## Inputs

- **`observation.state`**: Shape `(batch, 1, 7)` - Joint positions (7 DOF arm)
- **`observation.goal`**: Shape `(batch, 1, 3)` - Goal cartesian position (X, Y, Z)
- **`observation.images.table_camera`**: Shape `(batch, 1, 3, 480, 640)` - Table camera RGB images
- **`observation.proximity`**: Shape `(batch, 1, 128)` - Encoded proximity sensor latent (37 sensors → 128-dim via pretrained encoder)

## Outputs

- **`action`**: Shape `(batch, 16, 7)` - Joint positions (7 DOF) for 16-step horizon (next positions along trajectory)

**Note**: The model outputs a full 16-step horizon. Use `select_action()` to get the first step `(batch, 7)`, or `predict_action_chunk()` to get the full horizon `(batch, 16, 7)`.

## Normalization

### Input Normalization

**Images** (`observation.images.table_camera`):
- Normalize from `[0, 255]` to `[0, 1]` by dividing by `255.0`
- Then apply mean-std normalization using dataset statistics (handled by preprocessor)

**State** (`observation.state`):
- Apply min-max normalization: `(state - min) / (max - min)` using dataset statistics (handled by preprocessor)

**Goal** (`observation.goal`):
- Apply min-max normalization: `(goal - min) / (max - min)` using dataset statistics (handled by preprocessor)

**Proximity** (`observation.proximity`):
- Encoded via pretrained ProximityAutoencoder (frozen encoder)
- 37 sensors × (8×8 depth maps) → 128-dim latent
- Apply min-max normalization using dataset statistics (handled by preprocessor)

### Output Unnormalization

**Actions** (`action`):
- Apply inverse min-max normalization: `action * (max - min) + min` using dataset statistics (handled by postprocessor)
- **Note**: Actions are joint positions (not velocities) - these are the next positions the robot should move to along the trajectory

## Usage

```python
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
from lerobot.policies.factory import make_pre_post_processors

# Load model
policy = DiffusionPolicy.from_pretrained("calebescobedo/sensor-diffusion-policy-topdown-camera")

# Load preprocessor and postprocessor from the same repo
preprocessor, postprocessor = make_pre_post_processors(
    policy_cfg=policy.config,
    pretrained_path="calebescobedo/sensor-diffusion-policy-topdown-camera"
)

# Prepare inputs
batch = {
    'observation.state': state_tensor,  # (batch, 1, 7) - raw joint positions
    'observation.goal': goal_tensor,  # (batch, 1, 3) - raw goal xyz
    'observation.images.table_camera': table_img,  # (batch, 1, 3, 480, 640) - uint8 [0,255] or float [0,1]
    'observation.proximity': proximity_latent,  # (batch, 1, 128) - encoded proximity sensor latent
}

# Inference
policy.eval()
with torch.no_grad():
    batch = preprocessor(batch)  # Normalizes inputs
    actions = policy.select_action(batch)  # Returns normalized actions
    actions = postprocessor(actions)  # Unnormalizes to raw joint positions
```

## Training Details

- **Training**: Epoch-based (ensures all trajectories seen)
- **Epochs**: 60
- **Batch Size**: 64
- **Optimizer**: Adam (LeRobot preset)
- **Learning Rate**: From LeRobot optimizer preset
- **Mixed Precision**: Enabled (AMP)
- **Data Loading**: Optimized with persistent file handles (4 workers, prefetch=2)
- **Data Augmentation**: 
  - State noise: 30% probability, scale=0.005
  - Action noise: 30% probability, scale=0.0005
  - Goal noise: 30% probability, scale=[0.003, 0.005, 0.0005] (X, Y, Z)
- **Datasets**: 
  - roboset_20260117_014645 (20 H5 files, ~500 trajectories, ~17,000 sequences)

## Proximity Sensor Encoding

The proximity sensors are encoded using a pretrained autoencoder:
- **Encoder**: 37 sensors × (8×8 depth maps) → 128-dim latent
- **Architecture**: Per-sensor CNN (8×8 → 4×4 → 2×2) + Multi-head attention aggregation
- **Training**: Separate pretraining on depth reconstruction (MSE loss: ~0.118)
- **Status**: Encoder frozen during policy training (no gradients)

## Dataset Notes

- **37 proximity sensors** per timestep (depth_sensor_link1_sensor_0 through depth_sensor_link6_sensor_7)
- Each sensor provides **8×8 depth maps** (`depth_to_camera`)
- **Table camera RGB images** (480×640×3)
- **7-DOF joint positions**
- **Goal-conditioned trajectories**: Each trajectory has a unique goal (final cartesian position)
- **Goal distribution**: 
  - X: [-0.239, 0.294] meters
  - Y: [-0.284, 0.317] meters
  - Z: [0.364, 0.579] meters
- **Total**: ~500 trajectories, ~17,000 sequences

## License

MIT License