Proximity Sensor Goal-Conditioned Diffusion Policy
Model Description
A goal-conditioned Diffusion Policy trained on proximity sensor datasets. The model predicts joint positions (next positions along trajectory) conditioned on the current observation (joint positions, table camera image, encoded proximity sensor data) and a goal cartesian position.
Key Features:
- Uses 37 proximity sensors (8x8 depth maps) encoded to 128-dim latent via pretrained autoencoder
- Visual input from table camera (480x640 RGB)
- Goal-conditioned for reaching target cartesian positions
- Predicts 16-step action horizon
Model Architecture
- Policy Type: Diffusion Policy
- Framework: LeRobot
- Horizon: 16 steps
- Observation Steps: 1 step (single timestep)
- Action Steps: 8 steps (each covers 2 timesteps)
- Total Parameters: ~261M
Inputs
observation.state: Shape(batch, 1, 7)- Joint positions (7 DOF arm)observation.goal: Shape(batch, 1, 3)- Goal cartesian position (X, Y, Z)observation.images.table_camera: Shape(batch, 1, 3, 480, 640)- Table camera RGB imagesobservation.proximity: Shape(batch, 1, 128)- Encoded proximity sensor latent (37 sensors โ 128-dim via pretrained encoder)
Outputs
action: Shape(batch, 16, 7)- Joint positions (7 DOF) for 16-step horizon (next positions along trajectory)
Note: The model outputs a full 16-step horizon. Use select_action() to get the first step (batch, 7), or predict_action_chunk() to get the full horizon (batch, 16, 7).
Normalization
Input Normalization
Images (observation.images.table_camera):
- Normalize from
[0, 255]to[0, 1]by dividing by255.0 - Then apply mean-std normalization using dataset statistics (handled by preprocessor)
State (observation.state):
- Apply min-max normalization:
(state - min) / (max - min)using dataset statistics (handled by preprocessor)
Goal (observation.goal):
- Apply min-max normalization:
(goal - min) / (max - min)using dataset statistics (handled by preprocessor)
Proximity (observation.proximity):
- Encoded via pretrained ProximityAutoencoder (frozen encoder)
- 37 sensors ร (8ร8 depth maps) โ 128-dim latent
- Apply min-max normalization using dataset statistics (handled by preprocessor)
Output Unnormalization
Actions (action):
- Apply inverse min-max normalization:
action * (max - min) + minusing dataset statistics (handled by postprocessor) - Note: Actions are joint positions (not velocities) - these are the next positions the robot should move to along the trajectory
Usage
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
from lerobot.policies.factory import make_pre_post_processors
# Load model
policy = DiffusionPolicy.from_pretrained("calebescobedo/sensor-diffusion-policy-table-camera-300epoch")
# Load preprocessor and postprocessor from the same repo
preprocessor, postprocessor = make_pre_post_processors(
policy_cfg=policy.config,
pretrained_path="calebescobedo/sensor-diffusion-policy-table-camera-300epoch"
)
# Prepare inputs
batch = {
'observation.state': state_tensor, # (batch, 1, 7) - raw joint positions
'observation.goal': goal_tensor, # (batch, 1, 3) - raw goal xyz
'observation.images.table_camera': table_img, # (batch, 1, 3, 480, 640) - uint8 [0,255] or float [0,1]
'observation.proximity': proximity_latent, # (batch, 1, 128) - encoded proximity sensor latent
}
# Inference
policy.eval()
with torch.no_grad():
batch = preprocessor(batch) # Normalizes inputs
actions = policy.select_action(batch) # Returns normalized actions
actions = postprocessor(actions) # Unnormalizes to raw joint positions
Training Details
- Training: Epoch-based (ensures all trajectories seen)
- Epochs: 300
- Batch Size: 64
- Optimizer: Adam (LeRobot preset)
- Learning Rate: From LeRobot optimizer preset
- Mixed Precision: Enabled (AMP)
- Data Loading: Optimized with persistent file handles (4 workers, prefetch=2)
- Data Augmentation:
- State noise: 30% probability, scale=0.005
- Action noise: 30% probability, scale=0.0005
- Goal noise: 30% probability, scale=[0.003, 0.005, 0.0005] (X, Y, Z)
- Datasets:
- roboset_20260117_014645 (20 H5 files, ~500 trajectories, ~17,000 sequences)
Proximity Sensor Encoding
The proximity sensors are encoded using a pretrained autoencoder:
- Encoder: 37 sensors ร (8ร8 depth maps) โ 128-dim latent
- Architecture: Per-sensor CNN (8ร8 โ 4ร4 โ 2ร2) + Multi-head attention aggregation
- Training: Separate pretraining on depth reconstruction (MSE loss: ~0.118)
- Status: Encoder frozen during policy training (no gradients)
Dataset Notes
- 37 proximity sensors per timestep (depth_sensor_link1_sensor_0 through depth_sensor_link6_sensor_7)
- Each sensor provides 8ร8 depth maps (
depth_to_camera) - Table camera RGB images (480ร640ร3)
- 7-DOF joint positions
- Goal-conditioned trajectories: Each trajectory has a unique goal (final cartesian position)
- Goal distribution:
- X: [-0.239, 0.294] meters
- Y: [-0.284, 0.317] meters
- Z: [0.364, 0.579] meters
- Total: ~500 trajectories, ~17,000 sequences
License
MIT License
- Downloads last month
- 13