You can get the unity environment from GitHub.
Model Card: PPO Agent on 12x12-GrassWorld Deterministic (HushToucans Environment)
Model Details
- Model Type: Proximal Policy Optimization (PPO)
- Framework: Stable-Baselines3
- Environment: Custom Unity ML-Agents environment (HushToucans-12x12-GrassWorld Deterministic)
- Author: Ahmed El Mahdi BENDOU
- License: MIT
- Status: Prototype (first-stage training, baseline policy)
This model is the first trained policy in a planned curriculum learning pipeline. It demonstrates the agent’s ability to learn navigation and simple reward dynamics for throw actions.
Intended Use
- Baseline reference for future curriculum learning setups.
- Educational demonstration of Unity ML-Agents + PPO training.
Not intended for production or safety-critical applications.
Environment Specification
Name: 12x12-GrassWorld Deterministic Grid size: 12 × 12
Agent Actions:
- Move Forward
- Move Backward
- Turn Left / Turn Right
- Throw Banana 🍌
- Do nothing
Rewards:
- +1 for reaching/scoring a stationed toucan
- -1 for bumping into walls
- -0.01 penalty per step (encourages efficiency)
Special Mechanic:
- Agent can throw at 27° to hit a distant toucan (as illustrated).
Training Details
Trainer: PPO
Max steps: 5,000,000
Checkpoint frequency: every 200,000 steps
Hyperparameters
Batch size: 512
Buffer size: 51,200
Learning rate: 0.0001 (linear decay)
β (entropy regularization): 0.001
ε (PPO clip range): 0.2
λ (GAE): 0.99
Epochs per update: 3
Time horizon: 1000
Network Settings
Hidden units: 128
Layers: 2 fully connected
Normalization: Enabled
Reward Signals
Extrinsic:
γ = 0.99
Strength = 1.0
The policy has achieved basic competency: moving, avoiding walls, and occasionally scoring via the throw mechanic.
Evaluation
Observed Behavior: The agent successfully navigates to targets but still exhibits inefficient wandering. Throwing is used inconsistently.
Limitations:
- Overfitting to deterministic transitions.
- Suboptimal exploration.
- No stochasticity introduced yet (to be addressed in future curriculum).
Future experiments will evaluate robustness in stochastic or adversarial variations of GrassWorld.
Future Work
This model is the first step in a broader curriculum learning experiment, which will involve:
- Scaling from deterministic → stochastic environments.
- Introducing dynamic rewards and multiple agents.
- Logging and reproducibility reports hosted on GitHub.
Citation
If you use this model, please cite:
@misc{yourname2025grassworldppo,
author = {Ahmed El Mahdi BENDOU},
title = {PPO Agent trained on ToucanHush 12x12-GrassWorld Deterministic (Unity ML-Agents)},
year = {2025},
howpublished = {\url{https://huggingface.co/partzel/ToucanHush-12x12GrassWorldDeterministic}},
}
Assets Pack
All assets have been custom made for this environment and you can get them for free from here
- Downloads last month
- 22
