Fetch Pick-and-Place โ€” Behavior Cloning & DAgger policies

PyTorch policies for FetchPickAndPlace-v4 (Gymnasium-Robotics / MuJoCo), trained by imitation learning from a SAC+HER expert. This repo accompanies a controlled comparison of Behavior Cloning (BC), DAgger, and RL โ€” see the full write-up, figures, video, and training code on GitHub: fetch-imitation-learning.

Expert / demonstrator (separate repo): hhmm1122/fetch-pickandplace-sac-her.

Honest scope: an engineering + analysis project, not novel research. These weights are an MLP+MSE imitation of a tanh-squashed SAC policy; they reach the imitation policy-class ceiling (~0.8-0.9 success), not the expert's 1.0. The value is the clean, reproducible BC-vs-DAgger-vs-RL comparison and the quantified distribution-shift analysis.

Files

Each file is a representative seed-0 checkpoint. "This ckpt" = that exact model's success on the 100-seed eval harness; "3-seed mean" = mean +/- std over training seeds 0-2.

File Method Demo budget Online expert queries This ckpt 3-seed mean
bc_d50_s0.pt Behavior Cloning 50 demos - 0.35 0.31 +/- 0.05
bc_d200_s0.pt Behavior Cloning 200 demos - 0.72 0.81 +/- 0.06
dagger_d25_s0.pt DAgger 25 demos init 4,000 0.82 0.79 +/- 0.05
dagger_d50_s0.pt DAgger 50 demos init 4,000 0.92 0.89 +/- 0.02

Reference: the SAC+HER expert scores 1.000 on the same 100-seed harness.

How to load and run

The .pt files contain a state dict plus the input-normalization statistics. This snippet is self-contained (only needs torch, numpy, gymnasium, gymnasium-robotics):

import numpy as np, torch, torch.nn as nn
import gymnasium as gym, gymnasium_robotics

class MLPPolicy(nn.Module):
    def __init__(self, hidden):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(28, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, 4),
        )
    def forward(self, x):
        return self.net(x)

def load_policy(path):
    ckpt = torch.load(path, map_location="cpu", weights_only=False)
    net = MLPPolicy(ckpt["hidden"]); net.load_state_dict(ckpt["state_dict"]); net.eval()
    mean = np.asarray(ckpt["obs_mean"], np.float32)
    std  = np.asarray(ckpt["obs_std"],  np.float32)
    def act(obs):
        x = np.concatenate([obs["observation"], obs["desired_goal"]]).astype(np.float32)
        x = (x - mean) / std
        with torch.no_grad():
            a = net(torch.as_tensor(x).unsqueeze(0)).squeeze(0).numpy()
        return np.clip(a, -1.0, 1.0).astype(np.float32)
    return act

gym.register_envs(gymnasium_robotics)
env = gym.make("FetchPickAndPlace-v4", max_episode_steps=50)
policy = load_policy("dagger_d50_s0.pt")
obs, info = env.reset(seed=0)
success = 0.0
for _ in range(50):
    obs, _r, term, trunc, info = env.step(policy(obs))
    success = float(info["is_success"])
    if term or trunc:
        break
print("success:", success)

Policy input is 28-d: observation (25) concatenated with desired_goal (3); achieved_goal is dropped (redundant). Actions are 4-d [dx, dy, dz, gripper] clipped to [-1, 1].

Training (summary)

  • BC: supervised MSE on (state, expert-action) pairs from deterministic, success-filtered expert rollouts. 2x256 ReLU MLP, input standardization, best-val-MSE checkpoint.
  • DAgger: start from BC, then iterate - roll out the learner, query the expert at the learner's visited states, aggregate, retrain. 8 rounds x 10 episodes; beta decays 1->0 over the first 5 rounds; 4,000 cumulative expert queries.
  • Expert: deterministic SAC+HER.

Evaluation protocol (identical for every method)

  • FetchPickAndPlace-v4, max_episode_steps=50, sparse reward.
  • Success = info["is_success"] on the final step. 100 eval episodes, seeds 0-99.
  • Env: Python 3.10, gymnasium 1.2.3, gymnasium-robotics 1.4.2, stable-baselines3 2.8.0, mujoco 3.9.0, torch 2.4.0+cpu, numpy 2.2.6.

Limitations

  • Reaches the imitation policy-class ceiling (~0.8-0.9), not the expert's 1.0.
  • The expert is also the demonstrator and the RL baseline, so the IL-vs-RL comparison is a sample-efficiency / distribution-shift study, not a claim that IL beats RL.
  • Single task, state-based (no pixels).

Code, figures, distribution-shift analysis, and side-by-side video: https://github.com/IAMHassanMehmood/fetch-imitation-learning

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading