File size: 3,618 Bytes
826d8a3
 
5c7761b
 
826d8a3
 
 
 
5c7761b
 
 
 
 
 
 
826d8a3
 
5c7761b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
library_name: lerobot
license: apache-2.0
pipeline_tag: robotics
tags:
- act
- diffusion
- robotics
- imitation-learning
- behavior-cloning
- aloha
- pytorch_model_hub_mixin
- model_hub_mixin
datasets:
- JHeisler/aloha_solo_left_4_6_26
---

# Hybrid ACT+Diffusion β€” ALOHA Single-Arm (Left) β€” 13.4k steps

Custom **HybridACTDiffusion** policy: ACT visual encoder (ResNet18 + 4-layer Transformer, mean-pooled) feeds a Diffusion U-Net decoder (FiLM conditioning, DDPM training, DDIM 10-step inference). No VAE β€” diffusion handles multimodal action distributions directly.

This is the **initial 13.4k-step Hybrid baseline (S002)**. For the longer 40k retrain, see [JHeisler/aloha_solo_left_act_diffusion_40k](https://huggingface.co/JHeisler/aloha_solo_left_act_diffusion_40k).

## Architecture

```
Images (cam_high, cam_left_wrist) + State (dim=9)
     β”‚
     β–Ό
ACT Encoder (ResNet18 β†’ 4-layer Transformer) β†’ mean-pool β†’ (B, 512) global cond vector
     β”‚
     β–Ό
Diffusion U-Net (DiffusionConditionalUnet1d, FiLM modulation, down_dims=(256,512))
     β”‚  DDPM training / DDIM 10-step inference
     β–Ό
Action chunks (chunk_size=100, action_dim=9)
```

## Training Config

| Field | Value |
|---|---|
| Architecture | HybridACTDiffusion (ACT encoder + Diffusion U-Net) β€” see `lerobot/common/policies/hybrid_act_diffusion/` |
| Dataset | [JHeisler/aloha_solo_left_4_6_26](https://huggingface.co/datasets/JHeisler/aloha_solo_left_4_6_26) β€” 50 episodes, 29,785 samples, 30 fps |
| State / action dim | 9 / 9 |
| Cameras | `cam_high`, `cam_left_wrist` (3Γ—480Γ—640 each) |
| Steps | 13,400 |
| Batch size | 24 (DOE winner) |
| Learning rate | 3e-5 |
| Total samples seen | ~321K (~10.6 epochs) |
| AMP | enabled |
| torch.compile | enabled |
| Diffusion scheduler | DDPM training (100 timesteps, squaredcos_cap_v2), DDIM at inference (10 steps) |
| Final loss (DDPM noise-pred MSE) | 0.011–0.020 |
| Final grad norm | 0.2–0.7 |
| Wall clock | ~1h 16min on RTX A4500 |
| LeRobot pin | `96c7052777aca85d4e55dfba8f81586103ba8f61` (with custom hybrid_act_diffusion policy added) |

## Project Lineage

| Workstream | Model | Steps | Samples | HF |
|---|---|---|---|---|
| S001 | ACT | 13,400 | 640K | [act_left](https://huggingface.co/JHeisler/aloha_solo_left_4_6_26_act_left) |
| **S002** | **Hybrid ACT+Diffusion** | **13,400** | **321K** | **this repo** |
| S003 | ACT (shipped) | 40,000 | 1.92M | [act_left_40k](https://huggingface.co/JHeisler/aloha_solo_left_4_6_26_act_left_40k) |
| S004 | Hybrid ACT+Diffusion | 40,000 | 1.12M | [act_diffusion_40k](https://huggingface.co/JHeisler/aloha_solo_left_act_diffusion_40k) |

## Notes on loss comparability

DDPM noise-prediction MSE (this model) and ACT's L1+KL combo (S001/S003) are different loss surfaces β€” absolute loss values are NOT directly comparable across architectures. The right comparison is offline action L1 on held-out episodes or real-robot rollout success rate.

## Usage

The custom policy class lives in this project's LeRobot fork. To use:

```python
# Requires lerobot pinned to 96c7052 with hybrid_act_diffusion policy package added
from lerobot.common.policies.hybrid_act_diffusion.modeling_hybrid_act_diffusion import HybridACTDiffusionPolicy
policy = HybridACTDiffusionPolicy.from_pretrained("JHeisler/aloha_solo_left_act_diffusion")
```

## Citation / Course

EN.525.681 school project β€” JHU Whiting School of Engineering. Team: Jake Heisler, Laura Kroening, Purushottam Shukla.

Code reference: [HuggingFace LeRobot](https://github.com/huggingface/lerobot) at commit `96c7052` with custom hybrid policy package.