DQN Agent for Lunar Lander

A Deep Q-Network (DQN) agent trained on the LunarLander-v3 environment from Gymnasium, implemented from scratch using PyTorch.

Model Description

This model implements a DQN agent with experience replay and target network for stable training. The agent learns to land a lunar module safely between two flags on the moon's surface.

Architecture

Input: 8-dimensional state space (position, velocity, angle, angular velocity, leg contact)
Hidden Layer: 256 neurons with ReLU activation
Output: 4 actions (do nothing, fire left engine, fire main engine, fire right engine)
Loss Function: Smooth L1 Loss (Huber Loss)

Training Configuration

{
  "algorithm": "DQN",
  "environment": "LunarLander-v3",
  "total_episodes": 1700,
  "replay_buffer_size": 50000,
  "batch_size": 128,
  "hidden_size": 256,
  "epsilon_init": 1.0,
  "epsilon_min": 0.01,
  "epsilon_decay": 0.998,
  "learning_rate": 0.0003,
  "gamma": 0.99,
  "tau": 0.005,
  "optimizer": "AdamW",
  "loss_function": "SmoothL1Loss"
}

Training Results

Total Episodes: 4000
Final 100-Episode Average Reward: 105.12
Best Reward: 309.32
Average Episode Length (last 100): 624.98 steps

Key Features

Experience Replay: 50,000 transition buffer for stable learning
Target Network: Soft updates (τ=0.005) every 15 steps
Epsilon-Greedy Exploration: Decaying from 1.0 to 0.01
Gradient Clipping: Prevents exploding gradients
Early Stopping: Episodes terminate at reward > 500

Usage

Installation

pip install gymnasium torch numpy matplotlib

Loading and Using the Model

import torch
import torch.nn as nn
import torch.nn.functional as F
import gymnasium as gym
from huggingface_hub import hf_hub_download

# Define the network architecture (must match training)
class DQN(nn.Module):
    def __init__(self, state_dim=8, action_dim=4, hidden_size=256):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_size)
        self.fc2 = nn.Linear(hidden_size, action_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)

# Download model
model_path = hf_hub_download(
    repo_id="ketencrypt10n/dqn-lunar-lander", 
    filename="model.pth"
)

# Load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
policy_net = DQN(state_dim=8, action_dim=4, hidden_size=256).to(device)
policy_net.load_state_dict(torch.load(model_path, map_location=device))
policy_net.eval()

# Test the agent
env = gym.make("LunarLander-v3", render_mode="human")
state, info = env.reset()
for episode in range(30):
    done = False
    total_reward = 0
    state,info = env.reset()
    while not done:
        state_tensor = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
        with torch.no_grad():
            action = policy_net(state_tensor).max(1).indices.item()
        
        state, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        done = terminated or truncated

    print(f"Epsiode: episode+1 , Total Reward: total_reward")
env.close()

Visualizing Training Progress

import numpy as np
import matplotlib.pyplot as plt
from huggingface_hub import hf_hub_download

# Download training history
reward_path = hf_hub_download(repo_id="ketencrypt10n/dqn-lunar-lander", filename="reward_history.npy")
duration_path = hf_hub_download(repo_id="ketencrypt10n/dqn-lunar-lander", filename="episode_durations.npy")

rewards = np.load(reward_path)
durations = np.load(duration_path)

# Plot rewards
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(rewards, alpha=0.6)
plt.plot(np.convolve(rewards, np.ones(100)/100, mode='valid'), linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Training Rewards')
plt.legend(['Raw', '100-episode MA'])

plt.subplot(1, 2, 2)
plt.plot(durations, alpha=0.6)
plt.plot(np.convolve(durations, np.ones(100)/100, mode='valid'), linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Steps')
plt.title('Episode Duration')
plt.legend(['Raw', '100-episode MA'])
plt.tight_layout()
plt.show()

Implementation Details

This implementation was built from scratch using:

PyTorch for neural network implementation
NumPy for numerical operations
Gymnasium for the environment
No high-level RL libraries (no Stable Baselines3, RLlib, etc.)

Key Implementation Choices

Soft Target Updates: Instead of periodic hard updates, uses soft updates (τ=0.005) every 15 steps for smoother learning
Huber Loss: More robust to outliers than MSE
Gradient Clipping: Clips gradients to [-100, 100] for stability
Early Episode Termination: Stops episodes when reward exceeds 500 to save computation

Environment

The Lunar Lander environment challenges the agent to:

Land between two yellow flags
Minimize fuel consumption
Avoid crashes
Control position and velocity

Reward Structure:

Moving toward/away from landing pad: +/- reward
Moving faster/slower: -/+ reward
Legs touching ground: +10 each
Side engine use: -0.03 per frame
Main engine use: -0.3 per frame
Crash: -100
Safe landing: +100

Citation

@misc{dqn_lunar_lander,
  author = {ketencrypt10n},
  title = {DQN Agent for Lunar Lander},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ketencrypt10n/dqn-lunar-lander}}
}

License

This model is released for educational and research purposes.

Downloads last month: 1

Video Preview

Reinforcement Learning