Proximity Sensor Goal-Conditioned Diffusion Policy

Model Description

A goal-conditioned Diffusion Policy trained on proximity sensor datasets. The model predicts joint positions (next positions along trajectory) conditioned on the current observation (joint positions, table camera image, encoded proximity sensor data) and a goal cartesian position.

Key Features:

  • Uses 37 proximity sensors (8x8 depth maps) encoded to 128-dim latent via pretrained autoencoder
  • Visual input from table camera (480x640 RGB)
  • Goal-conditioned for reaching target cartesian positions
  • Predicts 16-step action horizon

Model Architecture

  • Policy Type: Diffusion Policy
  • Framework: LeRobot
  • Horizon: 16 steps
  • Observation Steps: 1 step (single timestep)
  • Action Steps: 8 steps (each covers 2 timesteps)
  • Total Parameters: ~261M

Inputs

  • observation.state: Shape (batch, 1, 7) - Joint positions (7 DOF arm)
  • observation.goal: Shape (batch, 1, 3) - Goal cartesian position (X, Y, Z)
  • observation.images.table_camera: Shape (batch, 1, 3, 480, 640) - Table camera RGB images
  • observation.proximity: Shape (batch, 1, 128) - Encoded proximity sensor latent (37 sensors โ†’ 128-dim via pretrained encoder)

Outputs

  • action: Shape (batch, 16, 7) - Joint positions (7 DOF) for 16-step horizon (next positions along trajectory)

Note: The model outputs a full 16-step horizon. Use select_action() to get the first step (batch, 7), or predict_action_chunk() to get the full horizon (batch, 16, 7).

Normalization

Input Normalization

Images (observation.images.table_camera):

  • Normalize from [0, 255] to [0, 1] by dividing by 255.0
  • Then apply mean-std normalization using dataset statistics (handled by preprocessor)

State (observation.state):

  • Apply min-max normalization: (state - min) / (max - min) using dataset statistics (handled by preprocessor)

Goal (observation.goal):

  • Apply min-max normalization: (goal - min) / (max - min) using dataset statistics (handled by preprocessor)

Proximity (observation.proximity):

  • Encoded via pretrained ProximityAutoencoder (frozen encoder)
  • 37 sensors ร— (8ร—8 depth maps) โ†’ 128-dim latent
  • Apply min-max normalization using dataset statistics (handled by preprocessor)

Output Unnormalization

Actions (action):

  • Apply inverse min-max normalization: action * (max - min) + min using dataset statistics (handled by postprocessor)
  • Note: Actions are joint positions (not velocities) - these are the next positions the robot should move to along the trajectory

Usage

from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
from lerobot.policies.factory import make_pre_post_processors

# Load model
policy = DiffusionPolicy.from_pretrained("calebescobedo/sensor-diffusion-policy-topdown-camera")

# Load preprocessor and postprocessor from the same repo
preprocessor, postprocessor = make_pre_post_processors(
    policy_cfg=policy.config,
    pretrained_path="calebescobedo/sensor-diffusion-policy-topdown-camera"
)

# Prepare inputs
batch = {
    'observation.state': state_tensor,  # (batch, 1, 7) - raw joint positions
    'observation.goal': goal_tensor,  # (batch, 1, 3) - raw goal xyz
    'observation.images.table_camera': table_img,  # (batch, 1, 3, 480, 640) - uint8 [0,255] or float [0,1]
    'observation.proximity': proximity_latent,  # (batch, 1, 128) - encoded proximity sensor latent
}

# Inference
policy.eval()
with torch.no_grad():
    batch = preprocessor(batch)  # Normalizes inputs
    actions = policy.select_action(batch)  # Returns normalized actions
    actions = postprocessor(actions)  # Unnormalizes to raw joint positions

Training Details

  • Training: Epoch-based (ensures all trajectories seen)
  • Epochs: 60
  • Batch Size: 64
  • Optimizer: Adam (LeRobot preset)
  • Learning Rate: From LeRobot optimizer preset
  • Mixed Precision: Enabled (AMP)
  • Data Loading: Optimized with persistent file handles (4 workers, prefetch=2)
  • Data Augmentation:
    • State noise: 30% probability, scale=0.005
    • Action noise: 30% probability, scale=0.0005
    • Goal noise: 30% probability, scale=[0.003, 0.005, 0.0005] (X, Y, Z)
  • Datasets:
    • roboset_20260117_014645 (20 H5 files, ~500 trajectories, ~17,000 sequences)

Proximity Sensor Encoding

The proximity sensors are encoded using a pretrained autoencoder:

  • Encoder: 37 sensors ร— (8ร—8 depth maps) โ†’ 128-dim latent
  • Architecture: Per-sensor CNN (8ร—8 โ†’ 4ร—4 โ†’ 2ร—2) + Multi-head attention aggregation
  • Training: Separate pretraining on depth reconstruction (MSE loss: ~0.118)
  • Status: Encoder frozen during policy training (no gradients)

Dataset Notes

  • 37 proximity sensors per timestep (depth_sensor_link1_sensor_0 through depth_sensor_link6_sensor_7)
  • Each sensor provides 8ร—8 depth maps (depth_to_camera)
  • Table camera RGB images (480ร—640ร—3)
  • 7-DOF joint positions
  • Goal-conditioned trajectories: Each trajectory has a unique goal (final cartesian position)
  • Goal distribution:
    • X: [-0.239, 0.294] meters
    • Y: [-0.284, 0.317] meters
    • Z: [0.364, 0.579] meters
  • Total: ~500 trajectories, ~17,000 sequences

License

MIT License

Downloads last month
11
Video Preview
loading