You can get the unity environment from GitHub.


Model Card: PPO Agent on 12x12-GrassWorld Deterministic (HushToucans Environment)

Model Details

  • Model Type: Proximal Policy Optimization (PPO)
  • Framework: Stable-Baselines3
  • Environment: Custom Unity ML-Agents environment (HushToucans-12x12-GrassWorld Deterministic)
  • Author: Ahmed El Mahdi BENDOU
  • License: MIT
  • Status: Prototype (first-stage training, baseline policy)

This model is the first trained policy in a planned curriculum learning pipeline. It demonstrates the agent’s ability to learn navigation and simple reward dynamics for throw actions.


Intended Use

  • Baseline reference for future curriculum learning setups.
  • Educational demonstration of Unity ML-Agents + PPO training.

Not intended for production or safety-critical applications.


Environment Specification

Name: 12x12-GrassWorld Deterministic Grid size: 12 × 12

Agent Actions:

  • Move Forward
  • Move Backward
  • Turn Left / Turn Right
  • Throw Banana 🍌
  • Do nothing

Rewards:

  • +1 for reaching/scoring a stationed toucan
  • -1 for bumping into walls
  • -0.01 penalty per step (encourages efficiency)

Special Mechanic:

  • Agent can throw at 27° to hit a distant toucan (as illustrated).

Environment Specification


Training Details

Trainer: PPO
Max steps: 5,000,000
Checkpoint frequency: every 200,000 steps

Hyperparameters
  Batch size: 512
  Buffer size: 51,200
  Learning rate: 0.0001 (linear decay)
  β (entropy regularization): 0.001
  ε (PPO clip range): 0.2
  λ (GAE): 0.99
  Epochs per update: 3
  Time horizon: 1000

  Network Settings
    Hidden units: 128
    Layers: 2 fully connected
    Normalization: Enabled
  Reward Signals
    Extrinsic:
      γ = 0.99
      Strength = 1.0

The policy has achieved basic competency: moving, avoiding walls, and occasionally scoring via the throw mechanic.


Evaluation

  • Observed Behavior: The agent successfully navigates to targets but still exhibits inefficient wandering. Throwing is used inconsistently.

  • Limitations:

    • Overfitting to deterministic transitions.
    • Suboptimal exploration.
    • No stochasticity introduced yet (to be addressed in future curriculum).

Future experiments will evaluate robustness in stochastic or adversarial variations of GrassWorld.


Future Work

This model is the first step in a broader curriculum learning experiment, which will involve:

  1. Scaling from deterministic → stochastic environments.
  2. Introducing dynamic rewards and multiple agents.
  3. Logging and reproducibility reports hosted on GitHub.

Citation

If you use this model, please cite:

@misc{yourname2025grassworldppo,
  author       = {Ahmed El Mahdi BENDOU},
  title        = {PPO Agent trained on ToucanHush 12x12-GrassWorld Deterministic (Unity ML-Agents)},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/partzel/ToucanHush-12x12GrassWorldDeterministic}},
}

Assets Pack

All assets have been custom made for this environment and you can get them for free from here


Downloads last month
22
Video Preview
loading

Collection including partzel/ToucanHush-12x12GrassWorldDeterministic