Reward Rush: Walker2d SAC (Stripped Model)

This repository contains a stripped Soft Actor-Critic (SAC) actor trained on the Walker2d-v4 environment.

The checkpoint has been cleaned to ensure:

  • CPU-only compatibility
  • Safe loading with weights_only=True
  • No CUDA tensors
  • No pickle or NumPy object issues
  • Easy inference without training code

Model Architecture

The model is a Gaussian SAC actor evaluated deterministically using the mean action.

Observation space:

  • Dimension stored in obs_dim inside the checkpoint

Action space:

  • Dimension stored in act_dim
  • Continuous actions in the range [-1, 1]

Network structure:

Input โ†’ Linear(obs_dim, hidden_dim) โ†’ ReLU
โ†’ Linear(hidden_dim, hidden_dim) โ†’ ReLU
โ†’ Linear(hidden_dim, hidden_dim) โ†’ ReLU
โ†’ Mean head: Linear(hidden_dim, act_dim)
โ†’ Log-Std head: Linear(hidden_dim, act_dim)

Only the mean head is used during evaluation.


Common Mistakes to Avoid

  1. Incorrect number of layers
    The checkpoint contains net.0, net.2, and net.4.
    This means the actor has three hidden Linear layers.
    Using fewer layers will cause unexpected key errors.

  2. Renaming layers
    The model must use self.net = nn.Sequential(...).
    Renaming layers to fc1, fc2, etc. will break loading.

  3. Hardcoding dimensions
    Do not hardcode observation or action sizes.
    Always read obs_dim and act_dim from the checkpoint.

  4. Using weights_only=False
    The model is already stripped.
    Always load with weights_only=True.

  5. Sampling actions
    This model is evaluated deterministically.
    Do not sample from a distribution or use log_std.


Download and Test Code

The script below downloads the model from Hugging Face and evaluates it for 100 episodes.

import torch
import gymnasium as gym
import numpy as np
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="Nharen/Reward_Rush_SAC_Walker",
    filename="Walker.pth"
)

ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=True)

obs_dim = ckpt["obs_dim"]
act_dim = ckpt["act_dim"]
hidden_dim = ckpt.get("hidden_dim", 256)

class SACActor(torch.nn.Module):
    def __init__(self, obs_dim, act_dim, hidden_dim=256):
        super().__init__()
        self.net = torch.nn.Sequential(
            torch.nn.Linear(obs_dim, hidden_dim),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_dim, hidden_dim),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_dim, hidden_dim),
            torch.nn.ReLU(),
        )
        self.mean = torch.nn.Linear(hidden_dim, act_dim)
        self.log_std = torch.nn.Linear(hidden_dim, act_dim)

    def forward(self, obs):
        x = self.net(obs)
        return torch.tanh(self.mean(x))

actor = SACActor(obs_dim, act_dim, hidden_dim)
actor.load_state_dict(ckpt["actor_state_dict"])
actor.eval()

env = gym.make("Walker2d-v4")

num_episodes = 100
episode_rewards = []

for ep in range(num_episodes):
    obs, _ = env.reset()
    done = False
    ep_reward = 0.0

    while not done:
        with torch.no_grad():
            obs_t = torch.tensor(obs, dtype=torch.float32).unsqueeze(0)
            action = actor(obs_t).squeeze(0).cpu().numpy()

        obs, reward, terminated, truncated, _ = env.step(action)
        ep_reward += reward
        done = terminated or truncated

    episode_rewards.append(ep_reward)
    print(f"Episode {ep + 1:3d} | Reward: {ep_reward:.2f}")

env.close()

episode_rewards = np.array(episode_rewards)

print("Episodes:", num_episodes)
print("Mean reward:", episode_rewards.mean())
print("Std reward:", episode_rewards.std())
print("Min reward:", episode_rewards.min())
print("Max reward:", episode_rewards.max())
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Evaluation results