Huggbottle
/

DeepRL_pixelcopter_policy

@@ -6,7 +6,7 @@ tags:
 - custom-implementation
 - deep-rl-class
 model-index:
-- name: Pixelcopter-RL
   results:
   - task:
       type: reinforcement-learning
@@ -16,422 +16,12 @@ model-index:
       type: Pixelcopter-PLE-v0
     metrics:
     - type: mean_reward
-      value: 13.10 +/- 6.89
       name: mean_reward
       verified: false
 ---
-# REINFORCE Agent for Pixelcopter-PLE-v0
-## Model Description
-This repository contains a trained REINFORCE (Policy Gradient) reinforcement learning agent that has learned to play Pixelcopter-PLE-v0, a challenging helicopter navigation game from the PyGame Learning Environment (PLE). The agent uses policy gradient methods to learn optimal flight control strategies through trial and error.
-### Model Details
-- **Algorithm**: REINFORCE (Monte Carlo Policy Gradient)
-- **Environment**: Pixelcopter-PLE-v0 (PyGame Learning Environment)
-- **Framework**: Custom implementation following Deep RL Course guidelines
-- **Task Type**: Discrete Control (Binary Actions)
-- **Action Space**: Discrete (2 actions: do nothing or thrust up)
-- **Observation Space**: Visual/pixel-based or feature-based state representation
-### Environment Overview
-Pixelcopter-PLE-v0 is a classic helicopter control game where:
-- **Objective**: Navigate a helicopter through obstacles without crashing
-- **Challenge**: Requires precise timing and control to avoid ceiling, floor, and obstacles
-- **Physics**: Gravity constantly pulls the helicopter down; player must apply thrust to maintain altitude
-- **Scoring**: Points are awarded for surviving longer and successfully navigating through gaps
-- **Difficulty**: Requires learning temporal dependencies and precise action timing
-## Performance
-The trained REINFORCE agent achieves the following performance metrics:
-- **Mean Reward**: 13.10 ± 6.89
-- **Performance Analysis**: This represents solid performance for this challenging environment
-- **Consistency**: The standard deviation indicates moderate variability, which is expected for policy gradient methods
-### Performance Context
-The mean reward of 13.10 demonstrates that the agent has successfully learned to:
-- Navigate through multiple obstacles before crashing
-- Balance altitude control against obstacle avoidance
-- Develop timing strategies for thrust application
-- Achieve consistent survival beyond random baseline performance
-The variability (±6.89) is characteristic of policy gradient methods and reflects the stochastic nature of the learned policy, which can lead to different episode outcomes based on exploration.
-## Algorithm: REINFORCE
-REINFORCE is a foundational policy gradient algorithm that:
-- **Direct Policy Learning**: Learns a parameterized policy directly (no value function)
-- **Monte Carlo Updates**: Uses complete episode returns for policy updates
-- **Policy Gradient**: Updates policy parameters in direction of higher expected returns
-- **Stochastic Policy**: Learns probabilistic action selection for exploration
-### Key Advantages
-- Simple and intuitive policy gradient approach
-- Works well with discrete and continuous action spaces
-- No need for value function approximation
-- Good educational foundation for understanding policy gradients
-## Usage
-### Installation Requirements
-```bash
-# Core dependencies
-pip install torch torchvision
-pip install gymnasium
-pip install pygame-learning-environment
-pip install numpy matplotlib
-# For visualization and analysis
-pip install pillow
-pip install imageio  # for gif creation
-```
-### Loading and Using the Model
-```python
-import torch
-import gymnasium as gym
-from ple import PLE
-from ple.games.pixelcopter import Pixelcopter
-import numpy as np
-# Load the trained model
-# Note: Adjust path based on your model file structure
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-model = torch.load("pixelcopter_reinforce_model.pth", map_location=device)
-model.eval()
-# Create the environment
-def create_pixelcopter_env():
-    game = Pixelcopter()
-    env = PLE(game, fps=30, display=True)  # Set display=False for headless
-    return env
-# Initialize environment
-env = create_pixelcopter_env()
-env.init()
-# Run trained agent
-def run_agent(model, env, episodes=5):
-    total_rewards = []
-    for episode in range(episodes):
-        env.reset_game()
-        episode_reward = 0
-        while not env.game_over():
-            # Get current state
-            state = env.getScreenRGB()  # or env.getGameState() if using features
-            state = preprocess_state(state)  # Apply your preprocessing
-            # Convert to tensor
-            state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
-            # Get action probabilities
-            with torch.no_grad():
-                action_probs = model(state_tensor)
-                action = torch.multinomial(action_probs, 1).item()
-            # Execute action (0: do nothing, 1: thrust)
-            reward = env.act(action)
-            episode_reward += reward
-        total_rewards.append(episode_reward)
-        print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}")
-    mean_reward = np.mean(total_rewards)
-    std_reward = np.std(total_rewards)
-    print(f"\nAverage Performance: {mean_reward:.2f} ± {std_reward:.2f}")
-    return total_rewards
-# Preprocessing function (adjust based on your model's input requirements)
-def preprocess_state(state):
-    """
-    Preprocess the game state for the neural network
-    This should match the preprocessing used during training
-    """
-    if isinstance(state, np.ndarray) and len(state.shape) == 3:
-        # If using image input
-        state = np.transpose(state, (2, 0, 1))  # Channel first
-        state = state / 255.0  # Normalize pixels
-        return state.flatten()  # or keep as image depending on model
-    else:
-        # If using game state features
-        return np.array(list(state.values()))
-# Run the agent
-rewards = run_agent(model, env, episodes=10)
-```
-### Training Your Own Agent
-```python
-import torch
-import torch.nn as nn
-import torch.optim as optim
-import numpy as np
-from collections import deque
-class PolicyNetwork(nn.Module):
-    def __init__(self, state_size, action_size, hidden_size=64):
-        super(PolicyNetwork, self).__init__()
-        self.fc1 = nn.Linear(state_size, hidden_size)
-        self.fc2 = nn.Linear(hidden_size, hidden_size)
-        self.fc3 = nn.Linear(hidden_size, action_size)
-        self.softmax = nn.Softmax(dim=1)
-    def forward(self, x):
-        x = torch.relu(self.fc1(x))
-        x = torch.relu(self.fc2(x))
-        x = self.fc3(x)
-        return self.softmax(x)
-class REINFORCEAgent:
-    def __init__(self, state_size, action_size, lr=0.001):
-        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-        self.policy_net = PolicyNetwork(state_size, action_size).to(self.device)
-        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)
-        self.saved_log_probs = []
-        self.rewards = []
-    def select_action(self, state):
-        state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
-        probs = self.policy_net(state)
-        action = torch.multinomial(probs, 1)
-        self.saved_log_probs.append(torch.log(probs.squeeze(0)[action]))
-        return action.item()
-    def update_policy(self, gamma=0.99):
-        # Calculate discounted rewards
-        discounted_rewards = []
-        R = 0
-        for r in reversed(self.rewards):
-            R = r + gamma * R
-            discounted_rewards.insert(0, R)
-        # Normalize rewards
-        discounted_rewards = torch.FloatTensor(discounted_rewards).to(self.device)
-        discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-8)
-        # Calculate policy loss
-        policy_loss = []
-        for log_prob, reward in zip(self.saved_log_probs, discounted_rewards):
-            policy_loss.append(-log_prob * reward)
-        # Update policy
-        self.optimizer.zero_grad()
-        policy_loss = torch.cat(policy_loss).sum()
-        policy_loss.backward()
-        self.optimizer.step()
-        # Clear episode data
-        self.saved_log_probs.clear()
-        self.rewards.clear()
-        return policy_loss.item()
-def train_agent(episodes=2000):
-    env = create_pixelcopter_env()
-    env.init()
-    # Determine state size based on your preprocessing
-    state_size = len(preprocess_state(env.getScreenRGB()))  # Adjust as needed
-    action_size = 2  # do nothing, thrust
-    agent = REINFORCEAgent(state_size, action_size)
-    scores = deque(maxlen=100)
-    for episode in range(episodes):
-        env.reset_game()
-        episode_reward = 0
-        while not env.game_over():
-            state = preprocess_state(env.getScreenRGB())
-            action = agent.select_action(state)
-            reward = env.act(action)
-            agent.rewards.append(reward)
-            episode_reward += reward
-        # Update policy after each episode
-        loss = agent.update_policy()
-        scores.append(episode_reward)
-        if episode % 100 == 0:
-            avg_score = np.mean(scores)
-            print(f"Episode {episode}, Average Score: {avg_score:.2f}, Loss: {loss:.4f}")
-    # Save the trained model
-    torch.save(agent.policy_net, "pixelcopter_reinforce_model.pth")
-    return agent
-# Train a new agent
-# trained_agent = train_agent()
-```
-### Evaluation and Analysis
-```python
-import matplotlib.pyplot as plt
-def evaluate_agent_detailed(model, env, episodes=50):
-    """Detailed evaluation with statistics and visualization"""
-    rewards = []
-    episode_lengths = []
-    for episode in range(episodes):
-        env.reset_game()
-        episode_reward = 0
-        steps = 0
-        while not env.game_over():
-            state = preprocess_state(env.getScreenRGB())
-            state_tensor = torch.FloatTensor(state).unsqueeze(0)
-            with torch.no_grad():
-                action_probs = model(state_tensor)
-                action = torch.multinomial(action_probs, 1).item()
-            reward = env.act(action)
-            episode_reward += reward
-            steps += 1
-        rewards.append(episode_reward)
-        episode_lengths.append(steps)
-        if (episode + 1) % 10 == 0:
-            print(f"Episodes {episode + 1}/{episodes} completed")
-    # Statistical analysis
-    mean_reward = np.mean(rewards)
-    std_reward = np.std(rewards)
-    median_reward = np.median(rewards)
-    max_reward = np.max(rewards)
-    min_reward = np.min(rewards)
-    mean_length = np.mean(episode_lengths)
-    print(f"\n--- Evaluation Results ---")
-    print(f"Episodes: {episodes}")
-    print(f"Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}")
-    print(f"Median Reward: {median_reward:.2f}")
-    print(f"Max Reward: {max_reward:.2f}")
-    print(f"Min Reward: {min_reward:.2f}")
-    print(f"Mean Episode Length: {mean_length:.1f} steps")
-    # Visualization
-    plt.figure(figsize=(12, 4))
-    plt.subplot(1, 2, 1)
-    plt.plot(rewards)
-    plt.axhline(y=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}')
-    plt.title('Episode Rewards')
-    plt.xlabel('Episode')
-    plt.ylabel('Reward')
-    plt.legend()
-    plt.subplot(1, 2, 2)
-    plt.hist(rewards, bins=20, alpha=0.7)
-    plt.axvline(x=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}')
-    plt.title('Reward Distribution')
-    plt.xlabel('Reward')
-    plt.ylabel('Frequency')
-    plt.legend()
-    plt.tight_layout()
-    plt.show()
-    return {
-        'rewards': rewards,
-        'episode_lengths': episode_lengths,
-        'stats': {
-            'mean': mean_reward,
-            'std': std_reward,
-            'median': median_reward,
-            'max': max_reward,
-            'min': min_reward
-        }
-    }
-# Run detailed evaluation
-# results = evaluate_agent_detailed(model, env, episodes=100)
-```
-## Training Information
-### Hyperparameters
-The REINFORCE agent was trained with carefully tuned hyperparameters:
-- **Learning Rate**: Optimized for stable policy gradient updates
-- **Discount Factor (γ)**: Balances immediate vs. future rewards
-- **Network Architecture**: Multi-layer perceptron with appropriate hidden dimensions
-- **Episode Length**: Sufficient episodes to learn temporal patterns
-### Training Environment
-- **State Representation**: Processed game screen or extracted features
-- **Action Space**: Binary discrete actions (do nothing vs. thrust)
-- **Reward Signal**: Game score progression with survival bonus
-- **Training Episodes**: Extended training to achieve stable performance
-### Algorithm Characteristics
-- **Sample Efficiency**: Moderate (typical for policy gradient methods)
-- **Stability**: Good convergence with proper hyperparameter tuning
-- **Exploration**: Built-in through stochastic policy
-- **Interpretability**: Clear policy learning through gradient ascent
-## Limitations and Considerations
-- **Sample Efficiency**: REINFORCE requires many episodes to learn effectively
-- **Variance**: Policy gradient estimates can have high variance
-- **Environment Specific**: Trained specifically for Pixelcopter game mechanics
-- **Stochastic Performance**: Episode rewards vary due to policy stochasticity
-- **Real-time Performance**: Inference speed suitable for real-time game play
-## Related Work and Extensions
-This model serves as an excellent educational example for:
-- **Policy Gradient Methods**: Understanding direct policy optimization
-- **Deep Reinforcement Learning**: Practical implementation of RL algorithms
-- **Game AI**: Learning complex temporal control tasks
-- **Baseline Comparisons**: Foundation for more advanced algorithms (A2C, PPO, etc.)
-## Citation
-If you use this model in your research or educational projects, please cite:
-```bibtex
-@misc{pixelcopter_reinforce_2024,
-  title={REINFORCE Agent for Pixelcopter-PLE-v0},
-  author={Adilbai},
-  year={2024},
-  publisher={Hugging Face},
-  howpublished={\url{https://huggingface.co/Adilbai/Pixelcopter-RL}},
-  note={Trained following Deep RL Course Unit 4}
-}
-```
-## Educational Resources
-This model was developed following the **Deep Reinforcement Learning Course Unit 4**:
-- **Course Link**: [https://huggingface.co/deep-rl-course/unit4/introduction](https://huggingface.co/deep-rl-course/unit4/introduction)
-- **Topic**: Policy Gradient Methods and REINFORCE
-- **Learning Objectives**: Understanding policy-based RL algorithms
-For comprehensive learning about REINFORCE and policy gradient methods, refer to the complete course materials.
-## License
-This model is distributed under the MIT License. The model is intended for educational and research purposes.

 - custom-implementation
 - deep-rl-class
 model-index:
+- name: DeepRL_pixelcopter_policy
   results:
   - task:
       type: reinforcement-learning
       type: Pixelcopter-PLE-v0
     metrics:
     - type: mean_reward
+      value: 31.50 +/- 29.61
       name: mean_reward
       verified: false
 ---
+  # **Reinforce** Agent playing **Pixelcopter-PLE-v0**
+  This is a trained model of a **Reinforce** agent playing **Pixelcopter-PLE-v0** .
+  To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction