| --- |
| library_name: lerobot |
| license: apache-2.0 |
| pipeline_tag: robotics |
| tags: |
| - act |
| - diffusion |
| - robotics |
| - imitation-learning |
| - behavior-cloning |
| - aloha |
| - pytorch_model_hub_mixin |
| - model_hub_mixin |
| datasets: |
| - JHeisler/aloha_solo_left_4_6_26 |
| --- |
| |
| # Hybrid ACT+Diffusion β ALOHA Single-Arm (Left) β 40k steps |
|
|
| Custom **HybridACTDiffusion** policy: ACT visual encoder (ResNet18 + 4-layer Transformer, mean-pooled) feeds a Diffusion U-Net decoder (FiLM conditioning, DDPM training, DDIM 10-step inference). No VAE β diffusion handles multimodal action distributions directly. |
|
|
| This is the **40k-step retrain (workstream S004)** matching S003's step count for direct architectural comparison vs the shipped ACT-40k baseline. For the initial 13.4k baseline, see [JHeisler/aloha_solo_left_act_diffusion](https://huggingface.co/JHeisler/aloha_solo_left_act_diffusion). |
|
|
| ## Architecture |
|
|
| ``` |
| Images (cam_high, cam_left_wrist) + State (dim=9) |
| β |
| βΌ |
| ACT Encoder (ResNet18 β 4-layer Transformer) β mean-pool β (B, 512) global cond vector |
| β |
| βΌ |
| Diffusion U-Net (DiffusionConditionalUnet1d, FiLM modulation, down_dims=(256,512)) |
| β DDPM training (100 timesteps) / DDIM 10-step inference |
| βΌ |
| Action chunks (chunk_size=100, action_dim=9) |
| ``` |
|
|
| ## Training Config |
|
|
| | Field | Value | |
| |---|---| |
| | Architecture | HybridACTDiffusion (ACT encoder + Diffusion U-Net) β see `lerobot/common/policies/hybrid_act_diffusion/` | |
| | Dataset | [JHeisler/aloha_solo_left_4_6_26](https://huggingface.co/datasets/JHeisler/aloha_solo_left_4_6_26) β 50 episodes, 29,785 samples, 30 fps | |
| | State / action dim | 9 / 9 | |
| | Cameras | `cam_high`, `cam_left_wrist` (3Γ480Γ640 each) | |
| | Steps | 40,000 | |
| | Batch size | 28 (adaptive DOE winner β beats bs=24 by 6.8% throughput at 91.3 smpl/s) | |
| | Learning rate | 3.5e-5 (linear-scaled from bs=24's 3e-5) | |
| | Total samples seen | ~1.12M (~37 epochs over the dataset) | |
| | AMP | enabled | |
| | torch.compile | enabled | |
| | Save freq | every 10,000 steps (10k / 20k / 30k / 40k checkpoints) | |
| | Diffusion scheduler | DDPM training (100 timesteps, squaredcos_cap_v2), DDIM at inference (10 steps) | |
| | Final loss (DDPM noise-pred MSE) | 0.003β0.007 | |
| | Final grad norm | ~0.10β0.18 | |
| | Wall clock | ~3h 53min on RTX A4500 | |
| | LeRobot pin | `96c7052777aca85d4e55dfba8f81586103ba8f61` (with custom hybrid_act_diffusion policy added) | |
|
|
| ## Project Lineage |
|
|
| | Workstream | Model | Steps | Samples | HF | |
| |---|---|---|---|---| |
| | S001 | ACT | 13,400 | 640K | [act_left](https://huggingface.co/JHeisler/aloha_solo_left_4_6_26_act_left) | |
| | S002 | Hybrid ACT+Diffusion | 13,400 | 321K | [act_diffusion](https://huggingface.co/JHeisler/aloha_solo_left_act_diffusion) | |
| | S003 | ACT (shipped) | 40,000 | 1.92M | [act_left_40k](https://huggingface.co/JHeisler/aloha_solo_left_4_6_26_act_left_40k) | |
| | **S004** | **Hybrid ACT+Diffusion** | **40,000** | **1.12M** | **this repo** | |
|
|
| S003 vs S004 is the apples-to-apples architectural comparison: same dataset, same step count, ACT-VAE vs ACT-Diffusion decoder. |
|
|
| ## Notes on loss comparability |
|
|
| DDPM noise-prediction MSE (this model) and ACT's L1+KL combo (S001/S003) are different loss surfaces β absolute loss values are NOT directly comparable across architectures. The right comparison is offline action L1 on held-out episodes or real-robot rollout success rate. |
|
|
| ## Usage |
|
|
| ```python |
| # Requires lerobot pinned to 96c7052 with hybrid_act_diffusion policy package added |
| from lerobot.common.policies.hybrid_act_diffusion.modeling_hybrid_act_diffusion import HybridACTDiffusionPolicy |
| policy = HybridACTDiffusionPolicy.from_pretrained("JHeisler/aloha_solo_left_act_diffusion_40k") |
| ``` |
|
|
| ## Citation / Course |
|
|
| EN.525.681 school project β JHU Whiting School of Engineering. Team: Jake Heisler, Laura Kroening, Purushottam Shukla. |
|
|
| Code reference: [HuggingFace LeRobot](https://github.com/huggingface/lerobot) at commit `96c7052` with custom hybrid policy package. |
|
|