Reward Rush: Lunar Lander DQN
This repository contains a Deep Q-Network agent trained for the LunarLander-v3 environment.
Model Architecture
The model uses a simple multi-layer perceptron structure with the following specifications:
- Input: 8 state observations
- Output: 4 discrete actions
- Architecture:
- Linear(8, 32) -> ReLU
- Linear(32, 32) -> ReLU
- Linear(32, 4)
Common Implementation Mistakes to Avoid
- Layer Names: The weights are saved with specific keys (fc1, fc2, fc3). Using nn.Sequential will cause a naming mismatch error (expecting fc.0, fc.2).
- Hidden Dimensions: This specific model was trained with 32 neurons per layer. Using 64 or 128 will result in a size mismatch error.
- Checkpoint Dictionary: The .pth file contains a dictionary. The weights must be accessed via the "policy_net_state_dict" key.
- Inference Output: Running the model returns a tensor of Q-values. You must use argmax(dim=1) to extract the action index before passing it to the environment.
Import and Test Code
import torch
import torch.nn as nn
import gymnasium as gym
import numpy as np
from huggingface_hub import hf_hub_download
class LunarNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(8, 32)
self.fc2 = nn.Linear(32, 32)
self.fc3 = nn.Linear(32, 4)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return self.fc3(x)
def run_evaluation():
repo_id = "Nharen/Reward_Rush_DQN_Lunar_Lander"
filename = "lunar_lander_dqn.pth"
path = hf_hub_download(repo_id=repo_id, filename=filename)
model = LunarNet()
checkpoint = torch.load(path, map_location='cpu', weights_only=True)
if isinstance(checkpoint, dict) and "policy_net_state_dict" in checkpoint:
state_dict = checkpoint["policy_net_state_dict"]
else:
state_dict = checkpoint
model.load_state_dict(state_dict)
model.eval()
env = gym.make("LunarLander-v3")
total_rewards = []
for _ in range(100):
state, _ = env.reset()
episode_reward = 0
done = False
while not done:
state_t = torch.as_tensor(state, dtype=torch.float32).unsqueeze(0)
with torch.no_grad():
action = model(state_t).argmax(dim=1).item()
state, reward, terminated, truncated, _ = env.step(action)
episode_reward += reward
done = terminated or truncated
total_rewards.append(episode_reward)
print(f"Average Reward: {np.mean(total_rewards)}")
env.close()
if __name__ == "__main__":
run_evaluation()
Evaluation results
- mean_reward on LunarLander-v3self-reported260.000
- n_evaluation_episodes on LunarLander-v3self-reported100.000