|
|
--- |
|
|
|
|
|
datasets: |
|
|
- lerobot/pusht |
|
|
library_name: lerobot |
|
|
license: apache-2.0 |
|
|
model_name: act |
|
|
pipeline_tag: robotics |
|
|
tags: |
|
|
- lerobot |
|
|
- robotics |
|
|
- act |
|
|
- pusht |
|
|
- imitation-learning |
|
|
- baseline |
|
|
|
|
|
--- |
|
|
|
|
|
# π€ ACT for Push-T (Baseline Benchmark) |
|
|
|
|
|
[](https://github.com/huggingface/lerobot) |
|
|
[](https://huggingface.co/datasets/lerobot/pusht) |
|
|
[](https://www.uestc.edu.cn/) |
|
|
[](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
|
|
## π― Research Purpose |
|
|
|
|
|
**Important Note:** This model was trained primarily for **academic comparison**βevaluating the performance difference between **ACT** and **Diffusion Policy** algorithms under identical training conditions (using the `lerobot/pusht` dataset). This is a benchmark experiment designed to analyze different algorithms' learning capabilities for this specific manipulation task, **not to train a highly successful practical model**. |
|
|
|
|
|
> **Summary:** This model represents the **ACT (Action Chunking with Transformers)** baseline trained on the Push-T task. It serves as a comparative benchmark for our research on Diffusion Policies. Despite 200k steps of training, ACT struggled to model the multimodal action distribution required for high-precision alignment in this task. |
|
|
|
|
|
- **π§© Task**: Push-T (Simulated) |
|
|
- **π§ Algorithm**: [ACT](https://arxiv.org/abs/2304.13705) (Action Chunking with Transformers) |
|
|
- **π Training Steps**: 200,000 |
|
|
- **π Author**: Graduate Student, **UESTC** (University of Electronic Science and Technology of China) |
|
|
|
|
|
--- |
|
|
|
|
|
## π¬ Benchmark Results (Baseline) |
|
|
|
|
|
This model establishes the baseline performance. Unlike Diffusion Policy, ACT tends to average out multimodal action possibilities, leading to "stiff" behavior or failure to perform fine-grained adjustments at the boundaries. |
|
|
|
|
|
### π Evaluation Metrics (50 Episodes) |
|
|
|
|
|
| Metric | Value | Interpretation | Status | |
|
|
| :--- | :---: | :--- | :---: | |
|
|
| **Success Rate** | **0.0%** | Failed to meet the strict >95% overlap criteria. | β | |
|
|
| **Avg Max Reward** | **0.51** | Partially covers the target (~50%), but lacks precision. | π§ | |
|
|
| **Avg Sum Reward** | **55.48** | Trajectories are valid but often stall or drift. | π | |
|
|
|
|
|
> **Analysis:** While the model learned the general reaching and pushing motion (Reward > 0.5), it consistently failed the final stage of the task. This highlights ACT's limitation in handling tasks requiring high-precision correction from multimodal demonstrations compared to Generative Policies. |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Model Details |
|
|
|
|
|
| Parameter | Description | |
|
|
| :--- | :--- | |
|
|
| **Architecture** | ResNet18 (Backbone) + Transformer Encoder-Decoder | |
|
|
| **Action Chunking** | 100 steps | |
|
|
| **VAE Enabled** | Yes (Latent Dim: 32) | |
|
|
| **Input** | Single Camera (84x84) + Agent Position | |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Training Configuration |
|
|
|
|
|
For reproducibility, here are the key parameters used during the training session. |
|
|
|
|
|
- **Batch Size**: 64 |
|
|
- **Optimizer**: AdamW (`lr=2e-5`) |
|
|
- **Scheduler**: Constant |
|
|
- **Vision**: ResNet18 (Pretrained ImageNet) |
|
|
- **Precision**: Mixed Precision (AMP) enabled |
|
|
|
|
|
### Original Training Command (My Training Mode) |
|
|
|
|
|
```bash |
|
|
python -m lerobot.scripts.lerobot_train |
|
|
--config_path act_pusht.yaml |
|
|
--dataset.repo_id lerobot/pusht |
|
|
--job_name aloha_sim_insertion_human_ACT_PushT |
|
|
--wandb.enable true |
|
|
--policy.repo_id Lemon-03/ACT_PushT_test |
|
|
``` |
|
|
|
|
|
### act_pusht.yaml |
|
|
|
|
|
<details> |
|
|
<summary>π <strong>Click to view full <code>act_pusht.yaml</code> configuration</strong></summary> |
|
|
|
|
|
```yaml |
|
|
# @package _global_ |
|
|
|
|
|
# Basic Settings |
|
|
seed: 100000 |
|
|
job_name: ACT-PushT |
|
|
steps: 200000 |
|
|
eval_freq: 10000 |
|
|
save_freq: 50000 |
|
|
log_freq: 250 |
|
|
batch_size: 64 |
|
|
|
|
|
# Dataset |
|
|
dataset: |
|
|
repo_id: lerobot/pusht |
|
|
|
|
|
# Evaluation |
|
|
eval: |
|
|
n_episodes: 50 |
|
|
batch_size: 8 |
|
|
|
|
|
# Environment |
|
|
env: |
|
|
type: pusht |
|
|
task: PushT-v0 |
|
|
fps: 10 |
|
|
|
|
|
# Policy Configuration |
|
|
policy: |
|
|
type: act |
|
|
|
|
|
# Vision Backbone |
|
|
vision_backbone: resnet18 |
|
|
pretrained_backbone_weights: ResNet18_Weights.IMAGENET1K_V1 |
|
|
replace_final_stride_with_dilation: false |
|
|
|
|
|
# Transformer Params |
|
|
pre_norm: false |
|
|
dim_model: 512 |
|
|
n_heads: 8 |
|
|
dim_feedforward: 3200 |
|
|
feedforward_activation: relu |
|
|
n_encoder_layers: 4 |
|
|
n_decoder_layers: 1 |
|
|
|
|
|
# VAE Params |
|
|
use_vae: true |
|
|
latent_dim: 32 |
|
|
n_vae_encoder_layers: 4 |
|
|
|
|
|
# Action Chunking |
|
|
chunk_size: 100 |
|
|
n_action_steps: 100 |
|
|
n_obs_steps: 1 |
|
|
|
|
|
# Training & Loss |
|
|
dropout: 0.1 |
|
|
kl_weight: 10.0 |
|
|
|
|
|
# Optimizer |
|
|
optimizer_lr: 2e-5 |
|
|
optimizer_lr_backbone: 2e-5 |
|
|
optimizer_weight_decay: 2e-4 |
|
|
|
|
|
use_amp: true |
|
|
``` |
|
|
</details> |
|
|
|
|
|
----- |
|
|
|
|
|
## π Evaluate (My Evaluation Mode) |
|
|
|
|
|
Run the following command in your terminal to evaluate the model for 50 episodes and save the visualization videos: |
|
|
|
|
|
```bash |
|
|
python -m lerobot.scripts.lerobot_eval \ |
|
|
--policy.type act \ |
|
|
--policy.pretrained_path outputs/train/2025-12-02/00-28-32_pusht_ACT_PushT/checkpoints/last/pretrained_model \ |
|
|
--eval.n_episodes 50 \ |
|
|
--eval.batch_size 10 \ |
|
|
--env.type pusht \ |
|
|
--env.task PushT-v0 |
|
|
``` |
|
|
|
|
|
To evaluate this model locally, run the following command: |
|
|
|
|
|
```bash |
|
|
python -m lerobot.scripts.lerobot_eval \ |
|
|
--policy.type act \ |
|
|
--policy.pretrained_path Lemon-03/pusht_ACT_PushT_test \ |
|
|
--eval.n_episodes 50 \ |
|
|
--eval.batch_size 10 \ |
|
|
--env.type pusht \ |
|
|
--env.task PushT-v0 |
|
|
``` |