WolfeClick / MODEL_CARD.md
Atharva
Add HF Space battle viewer, polish docs, and deployment files
16c6b0b
metadata
license: mit
base_model: Qwen/Qwen3-4B-Instruct-2507
tags:
  - pokemon
  - reinforcement-learning
  - grpo
  - openenv
  - poke-env
  - lora
datasets: []
pipeline_tag: text-generation

OpenEnv-WolfeClick: GRPO-Trained Pokemon Battler

LoRA adapters for Qwen3-4B-Instruct trained with GRPO to play competitive Pokemon Showdown battles.

Model Description

These adapters are trained on real rollout trajectories collected from live Pokemon Showdown battles (gen4randombattle format). The training pipeline:

  1. JSON warm-up SFT — teaches the model to output valid {"action": "move"|"switch", "choice": "..."} JSON
  2. GRPO training — optimizes the policy using shaped rewards from the OpenEnv-WolfeClick environment

The environment provides dense reward signals including damage dealt/taken, knockouts, healing, setup moves, type effectiveness, and penalties for illegal actions.

Checkpoints

Branch Description
grpo-qwen3-4b-run1 First GRPO run
grpo-qwen3-4b-run2 Second run with tuned reward shaping
grpo-qwen3-4b-run3 Third run, best performing

Usage

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

BASE_MODEL = "Qwen/Qwen3-4B-Instruct-2507"
ADAPTER_REPO = "Atharva2099/openenv-smogon-rl"
REVISION = "grpo-qwen3-4b-run3"

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(model, ADAPTER_REPO, revision=REVISION)
model.eval()

# Example battle state prompt
messages = [
    {"role": "system", "content": (
        "You are a competitive Pokemon battler. "
        "Output ONLY a single JSON object.\n"
        '{"action": "move" | "switch", "choice": "Exact Name"}'
    )},
    {"role": "user", "content": "<battle state markdown here>"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=24, temperature=0.3)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Training Details

  • Base model: Qwen3-4B-Instruct-2507
  • Method: LoRA + GRPO (Group Relative Policy Optimization)
  • Environment: gen4randombattle via poke-env + local Pokemon Showdown
  • Reward: Multi-component shaped reward (damage, KOs, healing, setup, type effectiveness, illegal action penalty)
  • Max steps: 30 per battle
  • Battle opponent: RandomPlayer (poke-env built-in)

Links