openenv-smogon-rl
A LoRA adapter on top of Qwen3-4B-Instruct trained with GRPO to play competitive Pokemon Showdown (Gen 4 OU). This is run 3 โ the best performing checkpoint from the series.
Links:
- GitHub โ OpenEnv-WolfeClick โ full environment, training pipeline, replay viewer
- Blog post โ How I turned competitive Pokemon into an RL environment for LLMs
- Live demo (HF Space) โ watch the model play a recorded battle
Why Pokemon?
Competitive Pokemon isn't just a children's game โ it's a partially observable, long-horizon decision problem. Each turn the model must:
- Manage hidden information (opponent team, items, abilities)
- Make tradeoffs that pay off 5โ10 turns later (stat setup, sacrifice plays)
- Output valid structured JSON within a constrained legal action space
That makes it a legitimate benchmark for LLM planning and reasoning, and a natural fit for GRPO training with real environment rewards.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-4B-Instruct-2507 |
| Adapter type | LoRA (PEFT) |
| LoRA rank | 32 |
| LoRA alpha | 32 |
| Dropout | 0.0 |
| Target modules | q/k/v/o proj, gate/up/down proj |
| Training method | GRPO on live battle trajectories |
| Format | Gen 4 OU (competitive, older format โ more stable) |
How to Use
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
tokenizer = AutoTokenizer.from_pretrained("Atharva2099/openenv-smogon-rl")
model = PeftModel.from_pretrained(base, "Atharva2099/openenv-smogon-rl")
# The model expects a structured markdown battle state and outputs a JSON action:
# {"action": "move" | "switch", "choice": "Exact Move or Pokemon Name"}
For full environment setup and battle loop, see the GitHub repo.
Training Pipeline
Base: Qwen3-4B-Instruct
|
[JSON SFT warm-up] teach the model to output valid action JSON
|
[Rollout collection] live battles against RandomPlayer on local Showdown
|
[GRPO training] optimize on real shaped rewards from environment
|
LoRA checkpoint -> HF Hub
Reward signal (dense, per-turn):
+1.0per 10% opponent HP dealt /-1.0per 10% HP lost+3.0opponent faint /-3.0self faint+0.5super effective hit /-1.0immune/no-effect+0.5per stat stage gained (capped),+1.0per 10% healed (capped)-10.0illegal action (hallucinated move/pokemon)-0.05step penalty (anti-stall)
Checkpoints
| Branch | Notes |
|---|---|
grpo-qwen3-4b-run1 |
First GRPO run, baseline |
grpo-qwen3-4b-run2 |
Tuned reward shaping |
grpo-qwen3-4b-run3 |
Best performing (this branch = main) |
Limitations
- Trained against
RandomPlayerโ will struggle against strong competitive opponents - Gen 4 OU format only; not tested on other generations or formats
- Small model (4B) with limited working memory for very long battles
- No search or lookahead โ purely reactive turn-by-turn policy
Author
Model tree for Atharva2099/openenv-smogon-rl
Base model
Qwen/Qwen3-4B-Instruct-2507