ACT-MODIFIED - MetaWorld MT-1 Shelf-Place

Model Description

This is a trained MODIFIED Action Chunking with Transformers (ACT) model for the MetaWorld MT-1 shelf-place-v3 task.

Architecture

Modified ACT uses images in both encoder and decoder (visual conditioning).

  • Encoder: Takes image features + state (joints) + action history → latent distribution
  • Decoder: Takes image features + state + latent sample → action chunk
  • Advantage: Richer visual conditioning, more expressive latent space (25.43M parameters)
  • Hypothesis: Should perform better with more training data

Training Details

  • Task: MetaWorld MT-1 shelf-place-v3
    • Single-task manipulation (place puck on shelf)
    • Varying object positions (randomized)
  • Observations:
    • State: 39-dimensional (joint positions, velocities, gripper info)
    • Images: 480×480 RGB (downsampled to 64×64 for processing)
  • Action Space: 4D continuous [Δx, Δy, Δz, gripper]
  • Training:
    • Demonstrations: 10 expert episodes (100% success)
    • Training samples: 4,500
    • Epochs: 50
    • Batch size: 8
    • Learning rate: 1e-4
    • Chunk size: 100 steps

Performance

  • Success Rate: 0% (limited training data)
  • Status: Converged, ready for evaluation with more data

Usage

Installation

# Clone repo and install
git clone https://huggingface.co/aryannzzz/act-metaworld-shelf-modified
pip install torch torchvision

Loading the Model

import torch
from pathlib import Path

# Load checkpoint
device = 'cuda' if torch.cuda.is_available() else 'cpu'
checkpoint = torch.load('model_modified.pt', map_location=device)

# Model config is in checkpoint['config']
model_config = checkpoint['config']
print("Model configuration:", model_config)

# The checkpoint contains:
# - model_state_dict: Model weights
# - config: Model architecture config
# - training_config: Training hyperparameters

Model Architecture Details

Configuration

{
  "dataset": {
    "batch_size": 8,
    "num_workers": 2,
    "val_split": 0.2
  },
  "model": {
    "joint_dim": 39,
    "action_dim": 4,
    "hidden_dim": 256,
    "latent_dim": 32,
    "n_encoder_layers": 4,
    "n_decoder_layers": 4,
    "n_heads": 8,
    "feedforward_dim": 1024,
    "dropout": 0.1
  },
  "chunking": {
    "chunk_size": 50,
    "temporal_ensemble_weight": 0.01
  },
  "training": {
    "epochs": 50,
    "learning_rate": 0.0001,
    "weight_decay": 0.0001,
    "kl_weight": 10.0,
    "grad_clip": 1.0
  },
  "env": {
    "task": "shelf-place-v3",
    "image_size": [
      480,
      480
    ],
    "action_space": 4,
    "state_space": 39
  },
  "logging": {
    "use_wandb": false,
    "log_every": 10,
    "save_every": 10
  }
}

Citation

If you use this model, please cite:

@article{zhao2023learning,
  title={Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware},
  author={Zhao, Tony Z and Kumar, Vikash and Levine, Sergey and Finn, Chelsea},
  journal={arXiv preprint arXiv:2304.13705},
  year={2023}
}

License

Apache License 2.0


Uploaded: 2025-12-11 22:12:29
Variant: modified
Repository: https://huggingface.co/aryannzzz/act-metaworld-shelf-modified

Downloads last month
-
Video Preview
loading