openenv-smogon-rl

A LoRA adapter on top of Qwen3-4B-Instruct trained with GRPO to play competitive Pokemon Showdown (Gen 4 OU). This is run 3 โ€” the best performing checkpoint from the series.

Links:


Why Pokemon?

Competitive Pokemon isn't just a children's game โ€” it's a partially observable, long-horizon decision problem. Each turn the model must:

  • Manage hidden information (opponent team, items, abilities)
  • Make tradeoffs that pay off 5โ€“10 turns later (stat setup, sacrifice plays)
  • Output valid structured JSON within a constrained legal action space

That makes it a legitimate benchmark for LLM planning and reasoning, and a natural fit for GRPO training with real environment rewards.


Model Details

Property Value
Base model Qwen/Qwen3-4B-Instruct-2507
Adapter type LoRA (PEFT)
LoRA rank 32
LoRA alpha 32
Dropout 0.0
Target modules q/k/v/o proj, gate/up/down proj
Training method GRPO on live battle trajectories
Format Gen 4 OU (competitive, older format โ€” more stable)

How to Use

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507")
tokenizer = AutoTokenizer.from_pretrained("Atharva2099/openenv-smogon-rl")
model = PeftModel.from_pretrained(base, "Atharva2099/openenv-smogon-rl")

# The model expects a structured markdown battle state and outputs a JSON action:
# {"action": "move" | "switch", "choice": "Exact Move or Pokemon Name"}

For full environment setup and battle loop, see the GitHub repo.


Training Pipeline

Base: Qwen3-4B-Instruct
    |
[JSON SFT warm-up]         teach the model to output valid action JSON
    |
[Rollout collection]       live battles against RandomPlayer on local Showdown
    |
[GRPO training]            optimize on real shaped rewards from environment
    |
LoRA checkpoint -> HF Hub

Reward signal (dense, per-turn):

  • +1.0 per 10% opponent HP dealt / -1.0 per 10% HP lost
  • +3.0 opponent faint / -3.0 self faint
  • +0.5 super effective hit / -1.0 immune/no-effect
  • +0.5 per stat stage gained (capped), +1.0 per 10% healed (capped)
  • -10.0 illegal action (hallucinated move/pokemon)
  • -0.05 step penalty (anti-stall)

Checkpoints

Branch Notes
grpo-qwen3-4b-run1 First GRPO run, baseline
grpo-qwen3-4b-run2 Tuned reward shaping
grpo-qwen3-4b-run3 Best performing (this branch = main)

Limitations

  • Trained against RandomPlayer โ€” will struggle against strong competitive opponents
  • Gen 4 OU format only; not tested on other generations or formats
  • Small model (4B) with limited working memory for very long battles
  • No search or lookahead โ€” purely reactive turn-by-turn policy

Author

Atharva โ€” Medium | GitHub

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for Atharva2099/openenv-smogon-rl

Adapter
(5196)
this model

Space using Atharva2099/openenv-smogon-rl 1