Reward Rush: HalfCheetah SAC

This repository contains a Soft Actor-Critic (SAC) agent trained for the HalfCheetah-v4 environment.

Model Architecture

The SAC actor is a multi-layer perceptron with the following specifications:

  • Input: 17 state observations
  • Output: 6 continuous actions
  • Architecture:
    • Linear(17, 256) -> ReLU
    • Linear(256, 256) -> ReLU
    • Linear(256, 6) for mean + Linear(256, 6) for log_std
  • Note: The actor outputs mean and log standard deviation for each action. For inference, only the mean is used, passed through a tanh activation to bound actions to [-1, 1].

Common Mistakes to Avoid

  • Layer Names: The checkpoint uses net, mean, and log_std. Do not try to redefine layers with different names (fc1, fc2) unless you remap the keys.
  • Output Dimensions: Ensure the actor matches the checkpoint dimensions (6 actions).
  • Continuous Actions: HalfCheetah requires numpy arrays for actions. Flatten tensors and convert to numpy.
  • Episode Evaluation: Always test over full episodes (100 recommended) to properly evaluate performance.
  • Checkpoint Loading: Use weights_only=True when loading .pth state dicts for safety.

Download and Test Code

import torch
import torch.nn as nn
import gymnasium as gym
import numpy as np
from huggingface_hub import hf_hub_download

# Load stripped checkpoint
ckpt = torch.load(
    hf_hub_download("Nharen/Reward_Rush_SAC_Half_Cheetah", "half_cheetah.pth"),
    weights_only=True
)

obs_dim = ckpt["obs_dim"]
act_dim = ckpt["act_dim"]
hidden_dim = ckpt.get("hidden_dim", 256)

# SAC Gaussian Actor
class SACActor(nn.Module):
    def __init__(self, obs_dim, act_dim, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        self.mean = nn.Linear(hidden_dim, act_dim)
        self.log_std = nn.Linear(hidden_dim, act_dim)

    def forward(self, obs):
        x = self.net(obs)
        mean = self.mean(x)
        return torch.tanh(mean)

# Instantiate actor
actor = SACActor(obs_dim, act_dim, hidden_dim)
actor.load_state_dict(ckpt["actor_state_dict"])
actor.eval()

# Environment
env = gym.make("HalfCheetah-v4")
num_episodes = 100
episode_rewards = []

# Run evaluation
for ep in range(num_episodes):
    obs, _ = env.reset()
    done = False
    ep_reward = 0.0

    while not done:
        with torch.no_grad():
            obs_t = torch.tensor(obs, dtype=torch.float32).unsqueeze(0)
            action = actor(obs_t).squeeze(0).cpu().numpy()
        obs, reward, terminated, truncated, _ = env.step(action)
        ep_reward += reward
        done = terminated or truncated

    episode_rewards.append(ep_reward)
    print(f"Episode {ep+1:3d} | Reward: {ep_reward:.2f}")

env.close()

# Results
episode_rewards = np.array(episode_rewards)
print("\n===== Evaluation Summary =====")
print(f"Episodes run: {num_episodes}")
print(f"Mean reward: {episode_rewards.mean():.2f}")
print(f"Std reward:  {episode_rewards.std():.2f}")
print(f"Min reward:  {episode_rewards.min():.2f}")
print(f"Max reward:  {episode_rewards.max():.2f}")
Downloads last month
-
Video Preview
loading

Evaluation results