Proximity Sensor Goal-Conditioned Diffusion Policy

Model Description

A goal-conditioned Diffusion Policy trained on proximity sensor datasets. The model predicts joint positions (next positions along trajectory) conditioned on the current observation (joint positions, table camera image, encoded proximity sensor data) and a goal cartesian position.

Key Features:

Uses 37 proximity sensors (8x8 depth maps) encoded to 128-dim latent via pretrained autoencoder
Visual input from table camera (480x640 RGB)
Goal-conditioned for reaching target cartesian positions
Predicts 16-step action horizon

Model Architecture

Policy Type: Diffusion Policy
Framework: LeRobot
Horizon: 16 steps
Observation Steps: 1 step (single timestep)
Action Steps: 8 steps (each covers 2 timesteps)
Total Parameters: ~261M

Inputs

observation.state: Shape (batch, 1, 7) - Joint positions (7 DOF arm)
observation.goal: Shape (batch, 1, 3) - Goal cartesian position (X, Y, Z)
observation.images.table_camera: Shape (batch, 1, 3, 480, 640) - Table camera RGB images
observation.proximity: Shape (batch, 1, 128) - Encoded proximity sensor latent (37 sensors → 128-dim via pretrained encoder)

Outputs

action: Shape (batch, 16, 7) - Joint positions (7 DOF) for 16-step horizon (next positions along trajectory)

Note: The model outputs a full 16-step horizon. Use select_action() to get the first step (batch, 7), or predict_action_chunk() to get the full horizon (batch, 16, 7).

Normalization

Input Normalization

Images (observation.images.table_camera):

Normalize from [0, 255] to [0, 1] by dividing by 255.0
Then apply mean-std normalization using dataset statistics (handled by preprocessor)

State (observation.state):

Apply min-max normalization: (state - min) / (max - min) using dataset statistics (handled by preprocessor)

Goal (observation.goal):

Apply min-max normalization: (goal - min) / (max - min) using dataset statistics (handled by preprocessor)

Proximity (observation.proximity):

Encoded via pretrained ProximityAutoencoder (frozen encoder)
37 sensors × (8×8 depth maps) → 128-dim latent
Apply min-max normalization using dataset statistics (handled by preprocessor)

Output Unnormalization

Actions (action):

Apply inverse min-max normalization: action * (max - min) + min using dataset statistics (handled by postprocessor)
Note: Actions are joint positions (not velocities) - these are the next positions the robot should move to along the trajectory

Usage

from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
from lerobot.policies.factory import make_pre_post_processors

# Load model
policy = DiffusionPolicy.from_pretrained("calebescobedo/sensor-diffusion-policy-topdown-camera")

# Load preprocessor and postprocessor from the same repo
preprocessor, postprocessor = make_pre_post_processors(
    policy_cfg=policy.config,
    pretrained_path="calebescobedo/sensor-diffusion-policy-topdown-camera"
)

# Prepare inputs
batch = {
    'observation.state': state_tensor,  # (batch, 1, 7) - raw joint positions
    'observation.goal': goal_tensor,  # (batch, 1, 3) - raw goal xyz
    'observation.images.table_camera': table_img,  # (batch, 1, 3, 480, 640) - uint8 [0,255] or float [0,1]
    'observation.proximity': proximity_latent,  # (batch, 1, 128) - encoded proximity sensor latent
}

# Inference
policy.eval()
with torch.no_grad():
    batch = preprocessor(batch)  # Normalizes inputs
    actions = policy.select_action(batch)  # Returns normalized actions
    actions = postprocessor(actions)  # Unnormalizes to raw joint positions

Training Details

Training: Epoch-based (ensures all trajectories seen)
Epochs: 60
Batch Size: 64
Optimizer: Adam (LeRobot preset)
Learning Rate: From LeRobot optimizer preset
Mixed Precision: Enabled (AMP)
Data Loading: Optimized with persistent file handles (4 workers, prefetch=2)
Data Augmentation:
- State noise: 30% probability, scale=0.005
- Action noise: 30% probability, scale=0.0005
- Goal noise: 30% probability, scale=[0.003, 0.005, 0.0005] (X, Y, Z)
Datasets:
- roboset_20260117_014645 (20 H5 files, ~500 trajectories, ~17,000 sequences)

Proximity Sensor Encoding

The proximity sensors are encoded using a pretrained autoencoder:

Encoder: 37 sensors × (8×8 depth maps) → 128-dim latent
Architecture: Per-sensor CNN (8×8 → 4×4 → 2×2) + Multi-head attention aggregation
Training: Separate pretraining on depth reconstruction (MSE loss: ~0.118)
Status: Encoder frozen during policy training (no gradients)

Dataset Notes

37 proximity sensors per timestep (depth_sensor_link1_sensor_0 through depth_sensor_link6_sensor_7)
Each sensor provides 8×8 depth maps (depth_to_camera)
Table camera RGB images (480×640×3)
7-DOF joint positions
Goal-conditioned trajectories: Each trajectory has a unique goal (final cartesian position)
Goal distribution:
- X: [-0.239, 0.294] meters
- Y: [-0.284, 0.317] meters
- Z: [0.364, 0.579] meters
Total: ~500 trajectories, ~17,000 sequences

License

MIT License

Downloads last month: 5

Video Preview

Robotics