DQN Agent for Lunar Lander
A Deep Q-Network (DQN) agent trained on the LunarLander-v3 environment from Gymnasium, implemented from scratch using PyTorch.
Model Description
This model implements a DQN agent with experience replay and target network for stable training. The agent learns to land a lunar module safely between two flags on the moon's surface.
Architecture
- Input: 8-dimensional state space (position, velocity, angle, angular velocity, leg contact)
- Hidden Layer: 256 neurons with ReLU activation
- Output: 4 actions (do nothing, fire left engine, fire main engine, fire right engine)
- Loss Function: Smooth L1 Loss (Huber Loss)
Training Configuration
{
"algorithm": "DQN",
"environment": "LunarLander-v3",
"total_episodes": 1700,
"replay_buffer_size": 50000,
"batch_size": 128,
"hidden_size": 256,
"epsilon_init": 1.0,
"epsilon_min": 0.01,
"epsilon_decay": 0.998,
"learning_rate": 0.0003,
"gamma": 0.99,
"tau": 0.005,
"optimizer": "AdamW",
"loss_function": "SmoothL1Loss"
}
Training Results
- Total Episodes: 4000
- Final 100-Episode Average Reward: 105.12
- Best Reward: 309.32
- Average Episode Length (last 100): 624.98 steps
Key Features
- Experience Replay: 50,000 transition buffer for stable learning
- Target Network: Soft updates (Ï„=0.005) every 15 steps
- Epsilon-Greedy Exploration: Decaying from 1.0 to 0.01
- Gradient Clipping: Prevents exploding gradients
- Early Stopping: Episodes terminate at reward > 500
Usage
Installation
pip install gymnasium torch numpy matplotlib
Loading and Using the Model
import torch
import torch.nn as nn
import torch.nn.functional as F
import gymnasium as gym
from huggingface_hub import hf_hub_download
# Define the network architecture (must match training)
class DQN(nn.Module):
def __init__(self, state_dim=8, action_dim=4, hidden_size=256):
super(DQN, self).__init__()
self.fc1 = nn.Linear(state_dim, hidden_size)
self.fc2 = nn.Linear(hidden_size, action_dim)
def forward(self, x):
x = F.relu(self.fc1(x))
return self.fc2(x)
# Download model
model_path = hf_hub_download(
repo_id="ketencrypt10n/dqn-lunar-lander",
filename="model.pth"
)
# Load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
policy_net = DQN(state_dim=8, action_dim=4, hidden_size=256).to(device)
policy_net.load_state_dict(torch.load(model_path, map_location=device))
policy_net.eval()
# Test the agent
env = gym.make("LunarLander-v3", render_mode="human")
state, info = env.reset()
for episode in range(30):
done = False
total_reward = 0
state,info = env.reset()
while not done:
state_tensor = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
with torch.no_grad():
action = policy_net(state_tensor).max(1).indices.item()
state, reward, terminated, truncated, info = env.step(action)
total_reward += reward
done = terminated or truncated
print(f"Epsiode: episode+1 , Total Reward: total_reward")
env.close()
Visualizing Training Progress
import numpy as np
import matplotlib.pyplot as plt
from huggingface_hub import hf_hub_download
# Download training history
reward_path = hf_hub_download(repo_id="ketencrypt10n/dqn-lunar-lander", filename="reward_history.npy")
duration_path = hf_hub_download(repo_id="ketencrypt10n/dqn-lunar-lander", filename="episode_durations.npy")
rewards = np.load(reward_path)
durations = np.load(duration_path)
# Plot rewards
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(rewards, alpha=0.6)
plt.plot(np.convolve(rewards, np.ones(100)/100, mode='valid'), linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Training Rewards')
plt.legend(['Raw', '100-episode MA'])
plt.subplot(1, 2, 2)
plt.plot(durations, alpha=0.6)
plt.plot(np.convolve(durations, np.ones(100)/100, mode='valid'), linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Steps')
plt.title('Episode Duration')
plt.legend(['Raw', '100-episode MA'])
plt.tight_layout()
plt.show()
Implementation Details
This implementation was built from scratch using:
- PyTorch for neural network implementation
- NumPy for numerical operations
- Gymnasium for the environment
- No high-level RL libraries (no Stable Baselines3, RLlib, etc.)
Key Implementation Choices
- Soft Target Updates: Instead of periodic hard updates, uses soft updates (Ï„=0.005) every 15 steps for smoother learning
- Huber Loss: More robust to outliers than MSE
- Gradient Clipping: Clips gradients to [-100, 100] for stability
- Early Episode Termination: Stops episodes when reward exceeds 500 to save computation
Environment
The Lunar Lander environment challenges the agent to:
- Land between two yellow flags
- Minimize fuel consumption
- Avoid crashes
- Control position and velocity
Reward Structure:
- Moving toward/away from landing pad: +/- reward
- Moving faster/slower: -/+ reward
- Legs touching ground: +10 each
- Side engine use: -0.03 per frame
- Main engine use: -0.3 per frame
- Crash: -100
- Safe landing: +100
Citation
@misc{dqn_lunar_lander,
author = {ketencrypt10n},
title = {DQN Agent for Lunar Lander},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ketencrypt10n/dqn-lunar-lander}}
}
License
This model is released for educational and research purposes.
- Downloads last month
- 12