Reward Rush: Walker2d SAC (Stripped Model)
This repository contains a stripped Soft Actor-Critic (SAC) actor trained on the Walker2d-v4 environment.
The checkpoint has been cleaned to ensure:
- CPU-only compatibility
- Safe loading with
weights_only=True - No CUDA tensors
- No pickle or NumPy object issues
- Easy inference without training code
Model Architecture
The model is a Gaussian SAC actor evaluated deterministically using the mean action.
Observation space:
- Dimension stored in
obs_diminside the checkpoint
Action space:
- Dimension stored in
act_dim - Continuous actions in the range [-1, 1]
Network structure:
Input โ Linear(obs_dim, hidden_dim) โ ReLU
โ Linear(hidden_dim, hidden_dim) โ ReLU
โ Linear(hidden_dim, hidden_dim) โ ReLU
โ Mean head: Linear(hidden_dim, act_dim)
โ Log-Std head: Linear(hidden_dim, act_dim)
Only the mean head is used during evaluation.
Common Mistakes to Avoid
Incorrect number of layers
The checkpoint containsnet.0,net.2, andnet.4.
This means the actor has three hidden Linear layers.
Using fewer layers will cause unexpected key errors.Renaming layers
The model must useself.net = nn.Sequential(...).
Renaming layers tofc1,fc2, etc. will break loading.Hardcoding dimensions
Do not hardcode observation or action sizes.
Always readobs_dimandact_dimfrom the checkpoint.Using weights_only=False
The model is already stripped.
Always load withweights_only=True.Sampling actions
This model is evaluated deterministically.
Do not sample from a distribution or uselog_std.
Download and Test Code
The script below downloads the model from Hugging Face and evaluates it for 100 episodes.
import torch
import gymnasium as gym
import numpy as np
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="Nharen/Reward_Rush_SAC_Walker",
filename="Walker.pth"
)
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=True)
obs_dim = ckpt["obs_dim"]
act_dim = ckpt["act_dim"]
hidden_dim = ckpt.get("hidden_dim", 256)
class SACActor(torch.nn.Module):
def __init__(self, obs_dim, act_dim, hidden_dim=256):
super().__init__()
self.net = torch.nn.Sequential(
torch.nn.Linear(obs_dim, hidden_dim),
torch.nn.ReLU(),
torch.nn.Linear(hidden_dim, hidden_dim),
torch.nn.ReLU(),
torch.nn.Linear(hidden_dim, hidden_dim),
torch.nn.ReLU(),
)
self.mean = torch.nn.Linear(hidden_dim, act_dim)
self.log_std = torch.nn.Linear(hidden_dim, act_dim)
def forward(self, obs):
x = self.net(obs)
return torch.tanh(self.mean(x))
actor = SACActor(obs_dim, act_dim, hidden_dim)
actor.load_state_dict(ckpt["actor_state_dict"])
actor.eval()
env = gym.make("Walker2d-v4")
num_episodes = 100
episode_rewards = []
for ep in range(num_episodes):
obs, _ = env.reset()
done = False
ep_reward = 0.0
while not done:
with torch.no_grad():
obs_t = torch.tensor(obs, dtype=torch.float32).unsqueeze(0)
action = actor(obs_t).squeeze(0).cpu().numpy()
obs, reward, terminated, truncated, _ = env.step(action)
ep_reward += reward
done = terminated or truncated
episode_rewards.append(ep_reward)
print(f"Episode {ep + 1:3d} | Reward: {ep_reward:.2f}")
env.close()
episode_rewards = np.array(episode_rewards)
print("Episodes:", num_episodes)
print("Mean reward:", episode_rewards.mean())
print("Std reward:", episode_rewards.std())
print("Min reward:", episode_rewards.min())
print("Max reward:", episode_rewards.max())
Evaluation results
- Mean Reward on Walker2d-v3self-reported4283.036
- Std Reward on Walker2d-v3self-reported128.536
- Mean Episode Length on Walker2d-v3self-reported997.180