Fetch Pick-and-Place โ Behavior Cloning & DAgger policies
PyTorch policies for FetchPickAndPlace-v4 (Gymnasium-Robotics / MuJoCo), trained by
imitation learning from a SAC+HER expert. This repo accompanies a controlled comparison
of Behavior Cloning (BC), DAgger, and RL โ see the full write-up, figures, video,
and training code on GitHub:
fetch-imitation-learning.
Expert / demonstrator (separate repo): hhmm1122/fetch-pickandplace-sac-her.
Honest scope: an engineering + analysis project, not novel research. These weights are an MLP+MSE imitation of a tanh-squashed SAC policy; they reach the imitation policy-class ceiling (~0.8-0.9 success), not the expert's 1.0. The value is the clean, reproducible BC-vs-DAgger-vs-RL comparison and the quantified distribution-shift analysis.
Files
Each file is a representative seed-0 checkpoint. "This ckpt" = that exact model's success on the 100-seed eval harness; "3-seed mean" = mean +/- std over training seeds 0-2.
| File | Method | Demo budget | Online expert queries | This ckpt | 3-seed mean |
|---|---|---|---|---|---|
bc_d50_s0.pt |
Behavior Cloning | 50 demos | - | 0.35 | 0.31 +/- 0.05 |
bc_d200_s0.pt |
Behavior Cloning | 200 demos | - | 0.72 | 0.81 +/- 0.06 |
dagger_d25_s0.pt |
DAgger | 25 demos init | 4,000 | 0.82 | 0.79 +/- 0.05 |
dagger_d50_s0.pt |
DAgger | 50 demos init | 4,000 | 0.92 | 0.89 +/- 0.02 |
Reference: the SAC+HER expert scores 1.000 on the same 100-seed harness.
How to load and run
The .pt files contain a state dict plus the input-normalization statistics. This snippet is
self-contained (only needs torch, numpy, gymnasium, gymnasium-robotics):
import numpy as np, torch, torch.nn as nn
import gymnasium as gym, gymnasium_robotics
class MLPPolicy(nn.Module):
def __init__(self, hidden):
super().__init__()
self.net = nn.Sequential(
nn.Linear(28, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, 4),
)
def forward(self, x):
return self.net(x)
def load_policy(path):
ckpt = torch.load(path, map_location="cpu", weights_only=False)
net = MLPPolicy(ckpt["hidden"]); net.load_state_dict(ckpt["state_dict"]); net.eval()
mean = np.asarray(ckpt["obs_mean"], np.float32)
std = np.asarray(ckpt["obs_std"], np.float32)
def act(obs):
x = np.concatenate([obs["observation"], obs["desired_goal"]]).astype(np.float32)
x = (x - mean) / std
with torch.no_grad():
a = net(torch.as_tensor(x).unsqueeze(0)).squeeze(0).numpy()
return np.clip(a, -1.0, 1.0).astype(np.float32)
return act
gym.register_envs(gymnasium_robotics)
env = gym.make("FetchPickAndPlace-v4", max_episode_steps=50)
policy = load_policy("dagger_d50_s0.pt")
obs, info = env.reset(seed=0)
success = 0.0
for _ in range(50):
obs, _r, term, trunc, info = env.step(policy(obs))
success = float(info["is_success"])
if term or trunc:
break
print("success:", success)
Policy input is 28-d: observation (25) concatenated with desired_goal (3); achieved_goal
is dropped (redundant). Actions are 4-d [dx, dy, dz, gripper] clipped to [-1, 1].
Training (summary)
- BC: supervised MSE on (state, expert-action) pairs from deterministic, success-filtered expert rollouts. 2x256 ReLU MLP, input standardization, best-val-MSE checkpoint.
- DAgger: start from BC, then iterate - roll out the learner, query the expert at the learner's visited states, aggregate, retrain. 8 rounds x 10 episodes; beta decays 1->0 over the first 5 rounds; 4,000 cumulative expert queries.
- Expert: deterministic SAC+HER.
Evaluation protocol (identical for every method)
FetchPickAndPlace-v4,max_episode_steps=50, sparse reward.- Success =
info["is_success"]on the final step. 100 eval episodes, seeds 0-99. - Env: Python 3.10, gymnasium 1.2.3, gymnasium-robotics 1.4.2, stable-baselines3 2.8.0, mujoco 3.9.0, torch 2.4.0+cpu, numpy 2.2.6.
Limitations
- Reaches the imitation policy-class ceiling (~0.8-0.9), not the expert's 1.0.
- The expert is also the demonstrator and the RL baseline, so the IL-vs-RL comparison is a sample-efficiency / distribution-shift study, not a claim that IL beats RL.
- Single task, state-based (no pixels).
Code, figures, distribution-shift analysis, and side-by-side video: https://github.com/IAMHassanMehmood/fetch-imitation-learning