You can get the unity environment from GitHub.

Model Card: PPO Agent on 12x12-GrassWorld Deterministic (HushToucans Environment)

Model Details

Model Type: Proximal Policy Optimization (PPO)
Framework: Stable-Baselines3
Environment: Custom Unity ML-Agents environment (HushToucans-12x12-GrassWorld Deterministic)
Author: Ahmed El Mahdi BENDOU
License: MIT
Status: Prototype (first-stage training, baseline policy)

This model is the first trained policy in a planned curriculum learning pipeline. It demonstrates the agent’s ability to learn navigation and simple reward dynamics for throw actions.

Intended Use

Baseline reference for future curriculum learning setups.
Educational demonstration of Unity ML-Agents + PPO training.

Not intended for production or safety-critical applications.

Environment Specification

Name: 12x12-GrassWorld Deterministic Grid size: 12 × 12

Agent Actions:

Move Forward
Move Backward
Turn Left / Turn Right
Throw Banana 🍌
Do nothing

Rewards:

+1 for reaching/scoring a stationed toucan
-1 for bumping into walls
-0.01 penalty per step (encourages efficiency)

Special Mechanic:

Agent can throw at 27° to hit a distant toucan (as illustrated).

Training Details

Trainer: PPO
Max steps: 5,000,000
Checkpoint frequency: every 200,000 steps

Hyperparameters
  Batch size: 512
  Buffer size: 51,200
  Learning rate: 0.0001 (linear decay)
  β (entropy regularization): 0.001
  ε (PPO clip range): 0.2
  λ (GAE): 0.99
  Epochs per update: 3
  Time horizon: 1000

  Network Settings
    Hidden units: 128
    Layers: 2 fully connected
    Normalization: Enabled
  Reward Signals
    Extrinsic:
      γ = 0.99
      Strength = 1.0

The policy has achieved basic competency: moving, avoiding walls, and occasionally scoring via the throw mechanic.

Evaluation

Observed Behavior: The agent successfully navigates to targets but still exhibits inefficient wandering. Throwing is used inconsistently.
Limitations:
- Overfitting to deterministic transitions.
- Suboptimal exploration.
- No stochasticity introduced yet (to be addressed in future curriculum).

Future experiments will evaluate robustness in stochastic or adversarial variations of GrassWorld.

Future Work

This model is the first step in a broader curriculum learning experiment, which will involve:

Scaling from deterministic → stochastic environments.
Introducing dynamic rewards and multiple agents.
Logging and reproducibility reports hosted on GitHub.

Citation

If you use this model, please cite:

@misc{yourname2025grassworldppo,
  author       = {Ahmed El Mahdi BENDOU},
  title        = {PPO Agent trained on ToucanHush 12x12-GrassWorld Deterministic (Unity ML-Agents)},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/partzel/ToucanHush-12x12GrassWorldDeterministic}},
}

Assets Pack

All assets have been custom made for this environment and you can get them for free from here

Downloads last month: 6

Video Preview

Reinforcement Learning

Collection including partzel/ToucanHush-12x12GrassWorldDeterministic

ToucanHush RL Models

Collection

Models trained on RL the ToucanHush Unity Custom Environment • 2 items • Updated Sep 7, 2025