openenv-smogon-rl

A LoRA adapter on top of Qwen3-4B-Instruct trained with GRPO to play competitive Pokemon Showdown (Gen 4 OU). This is run 3 — the best performing checkpoint from the series.

Links:

GitHub — OpenEnv-WolfeClick — full environment, training pipeline, replay viewer
Blog post — How I turned competitive Pokemon into an RL environment for LLMs
Live demo (HF Space) — watch the model play a recorded battle

Why Pokemon?

Competitive Pokemon isn't just a children's game — it's a partially observable, long-horizon decision problem. Each turn the model must:

Manage hidden information (opponent team, items, abilities)
Make tradeoffs that pay off 5–10 turns later (stat setup, sacrifice plays)
Output valid structured JSON within a constrained legal action space

That makes it a legitimate benchmark for LLM planning and reasoning, and a natural fit for GRPO training with real environment rewards.

Model Details

Property	Value
Base model	`Qwen/Qwen3-4B-Instruct-2507`
Adapter type	LoRA (PEFT)
LoRA rank	32
LoRA alpha	32
Dropout	0.0
Target modules	q/k/v/o proj, gate/up/down proj
Training method	GRPO on live battle trajectories
Format	Gen 4 OU (competitive, older format — more stable)

How to Use

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
tokenizer = AutoTokenizer.from_pretrained("Atharva2099/openenv-smogon-rl")
model = PeftModel.from_pretrained(base, "Atharva2099/openenv-smogon-rl")

# The model expects a structured markdown battle state and outputs a JSON action:
# {"action": "move" | "switch", "choice": "Exact Move or Pokemon Name"}

For full environment setup and battle loop, see the GitHub repo.

Training Pipeline

Base: Qwen3-4B-Instruct
    |
[JSON SFT warm-up]         teach the model to output valid action JSON
    |
[Rollout collection]       live battles against RandomPlayer on local Showdown
    |
[GRPO training]            optimize on real shaped rewards from environment
    |
LoRA checkpoint -> HF Hub

Reward signal (dense, per-turn):

+1.0 per 10% opponent HP dealt / -1.0 per 10% HP lost
+3.0 opponent faint / -3.0 self faint
+0.5 super effective hit / -1.0 immune/no-effect
+0.5 per stat stage gained (capped), +1.0 per 10% healed (capped)
-10.0 illegal action (hallucinated move/pokemon)
-0.05 step penalty (anti-stall)

Checkpoints

Branch	Notes
`grpo-qwen3-4b-run1`	First GRPO run, baseline
`grpo-qwen3-4b-run2`	Tuned reward shaping
`grpo-qwen3-4b-run3`	Best performing (this branch = main)

Limitations

Trained against RandomPlayer — will struggle against strong competitive opponents
Gen 4 OU format only; not tested on other generations or formats
Small model (4B) with limited working memory for very long battles
No search or lookahead — purely reactive turn-by-turn policy

Author

Atharva — Medium | GitHub

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Model tree for Atharva2099/openenv-smogon-rl

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(5610)

this model

Atharva2099
/

openenv-smogon-rl