Reward Rush: HalfCheetah SAC
This repository contains a Soft Actor-Critic (SAC) agent trained for the HalfCheetah-v4 environment.
Model Architecture
The SAC actor is a multi-layer perceptron with the following specifications:
- Input: 17 state observations
- Output: 6 continuous actions
- Architecture:
- Linear(17, 256) -> ReLU
- Linear(256, 256) -> ReLU
- Linear(256, 6) for
mean+ Linear(256, 6) forlog_std
- Note: The actor outputs mean and log standard deviation for each action. For inference, only the mean is used, passed through a tanh activation to bound actions to [-1, 1].
Common Mistakes to Avoid
- Layer Names: The checkpoint uses
net,mean, andlog_std. Do not try to redefine layers with different names (fc1,fc2) unless you remap the keys. - Output Dimensions: Ensure the actor matches the checkpoint dimensions (6 actions).
- Continuous Actions: HalfCheetah requires numpy arrays for actions. Flatten tensors and convert to numpy.
- Episode Evaluation: Always test over full episodes (100 recommended) to properly evaluate performance.
- Checkpoint Loading: Use
weights_only=Truewhen loading.pthstate dicts for safety.
Download and Test Code
import torch
import torch.nn as nn
import gymnasium as gym
import numpy as np
from huggingface_hub import hf_hub_download
# Load stripped checkpoint
ckpt = torch.load(
hf_hub_download("Nharen/Reward_Rush_SAC_Half_Cheetah", "half_cheetah.pth"),
weights_only=True
)
obs_dim = ckpt["obs_dim"]
act_dim = ckpt["act_dim"]
hidden_dim = ckpt.get("hidden_dim", 256)
# SAC Gaussian Actor
class SACActor(nn.Module):
def __init__(self, obs_dim, act_dim, hidden_dim=256):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU()
)
self.mean = nn.Linear(hidden_dim, act_dim)
self.log_std = nn.Linear(hidden_dim, act_dim)
def forward(self, obs):
x = self.net(obs)
mean = self.mean(x)
return torch.tanh(mean)
# Instantiate actor
actor = SACActor(obs_dim, act_dim, hidden_dim)
actor.load_state_dict(ckpt["actor_state_dict"])
actor.eval()
# Environment
env = gym.make("HalfCheetah-v4")
num_episodes = 100
episode_rewards = []
# Run evaluation
for ep in range(num_episodes):
obs, _ = env.reset()
done = False
ep_reward = 0.0
while not done:
with torch.no_grad():
obs_t = torch.tensor(obs, dtype=torch.float32).unsqueeze(0)
action = actor(obs_t).squeeze(0).cpu().numpy()
obs, reward, terminated, truncated, _ = env.step(action)
ep_reward += reward
done = terminated or truncated
episode_rewards.append(ep_reward)
print(f"Episode {ep+1:3d} | Reward: {ep_reward:.2f}")
env.close()
# Results
episode_rewards = np.array(episode_rewards)
print("\n===== Evaluation Summary =====")
print(f"Episodes run: {num_episodes}")
print(f"Mean reward: {episode_rewards.mean():.2f}")
print(f"Std reward: {episode_rewards.std():.2f}")
print(f"Min reward: {episode_rewards.min():.2f}")
print(f"Max reward: {episode_rewards.max():.2f}")
- Downloads last month
- -
Evaluation results
- Avg reward on HalfCheetah-v4self-reported9692.192
- Max reward on HalfCheetah-v4self-reported9969.899
- Min reward on HalfCheetah-v4self-reported9408.777