| # RL Training with OpenEnv: 2048 Game |
|
|
| This tutorial covers training a language model to play the 2048 game using |
| reinforcement learning with GRPO (Group Relative Policy Optimization). |
|
|
| ```{note} |
| **Time**: ~45 minutes | **Difficulty**: Advanced | **GPU Required**: Yes (T4 or better) |
| ``` |
|
|
| ## What You'll Learn |
|
|
| - **Model Setup**: Load and configure LLMs with Unsloth for efficient RL |
| - **Environment Connection**: Connect to the 2048 OpenEnv environment |
| - **Reward Design**: Create effective reward functions |
| - **GRPO Training**: Train models with reinforcement learning |
| - **Deployment**: Save and deploy trained models |
|
|
| ## Prerequisites |
|
|
| Before starting this tutorial, you should have completed the |
| [Getting Started](/auto_getting_started/index) series to understand: |
|
|
| - How OpenEnv environments work |
| - The reset/step/state API pattern |
| - How to connect to environments |
|
|
| You'll also need: |
|
|
| - A GPU (free T4 on Google Colab works) |
| - Basic understanding of PyTorch |
| - ~30 minutes for training |
|
|
| ## Part 1: Environment Setup |
|
|
| ### Installation |
|
|
| ```bash |
| # Install required packages |
| !pip install -q unsloth openenv-core trl |
| |
| # For Google Colab, also run: |
| !pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" |
| ``` |
|
|
| ### Imports |
|
|
| ```python |
| import torch |
| from dataclasses import dataclass |
| from typing import List, Optional, Dict, Any |
| import random |
| |
| # Check GPU availability |
| print(f"GPU Available: {torch.cuda.is_available()}") |
| if torch.cuda.is_available(): |
| print(f"GPU: {torch.cuda.get_device_name(0)}") |
| print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB") |
| ``` |
|
|
| ## Part 2: Model Configuration |
|
|
| We use Unsloth for memory-efficient training with LoRA adapters. |
|
|
| ### Configuration Classes |
|
|
| ```python |
| @dataclass |
| class ModelConfig: |
| """Configuration for loading LLM models.""" |
| model_name: str = "unsloth/Qwen2.5-1.5B" |
| max_seq_length: int = 768 |
| load_in_4bit: bool = True |
| dtype: Optional[str] = None # Auto-detect |
| |
| |
| @dataclass |
| class LoRAConfig: |
| """Configuration for LoRA fine-tuning.""" |
| r: int = 16 |
| lora_alpha: int = 32 |
| target_modules: List[str] = None |
| lora_dropout: float = 0.0 |
| |
| def __post_init__(self): |
| if self.target_modules is None: |
| self.target_modules = [ |
| "q_proj", "k_proj", "v_proj", "o_proj", |
| "gate_proj", "up_proj", "down_proj", |
| ] |
| ``` |
|
|
| ### Loading the Model |
|
|
| ```python |
| from unsloth import FastLanguageModel |
| |
| # Create configurations |
| model_config = ModelConfig() |
| lora_config = LoRAConfig() |
| |
| # Load model |
| model, tokenizer = FastLanguageModel.from_pretrained( |
| model_name=model_config.model_name, |
| max_seq_length=model_config.max_seq_length, |
| load_in_4bit=model_config.load_in_4bit, |
| dtype=model_config.dtype, |
| ) |
| |
| # Apply LoRA adapters |
| model = FastLanguageModel.get_peft_model( |
| model, |
| r=lora_config.r, |
| target_modules=lora_config.target_modules, |
| lora_alpha=lora_config.lora_alpha, |
| lora_dropout=lora_config.lora_dropout, |
| bias="none", |
| use_gradient_checkpointing="unsloth", |
| random_state=42, |
| ) |
| |
| # Check parameter counts |
| trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) |
| total = sum(p.numel() for p in model.parameters()) |
| print(f"Trainable: {trainable:,} / {total:,} ({trainable/total*100:.2f}%)") |
| ``` |
|
|
| ## Part 3: The 2048 Environment |
|
|
| ### Game Overview |
|
|
| 2048 is a sliding puzzle game where you combine tiles to reach 2048. |
|
|
| **Actions:** |
| - `0` = UP |
| - `1` = RIGHT |
| - `2` = DOWN |
| - `3` = LEFT |
|
|
| **Goal:** Create a tile with value 2048 (or higher!) |
|
|
| ### Connecting to the Environment |
|
|
| ```python |
| from envs.openspiel_env import OpenSpielEnv, OpenSpielAction |
| |
| # Connect to 2048 environment |
| # Option 1: From Hub |
| env = OpenSpielEnv.from_hub("openenv/openspiel-env") |
| |
| # Option 2: From running server |
| # env = OpenSpielEnv(base_url="http://localhost:8000") |
| |
| # Test connection |
| with env: |
| result = env.reset() |
| print(f"Game started!") |
| print(f"Legal actions: {result.observation.legal_actions}") |
| |
| # Take a test action |
| action = OpenSpielAction(action_id=0, game_name="2048") |
| result = env.step(action) |
| print(f"After UP: reward={result.reward}, done={result.done}") |
| ``` |
|
|
| ### Board Utilities |
|
|
| ```python |
| import numpy as np |
| from typing import List |
| |
| def info_state_to_board(info_state: List[int], size: int = 4) -> List[List[int]]: |
| """Convert flat info_state to 2D board.""" |
| return np.array(info_state, dtype=int).reshape(size, size).tolist() |
| |
| def render_board(board: List[List[int]]) -> str: |
| """Render board as ASCII string.""" |
| lines = ["+------" * len(board[0]) + "+"] |
| for row in board: |
| cells = [f"{v:5d}" if v > 0 else " ." for v in row] |
| lines.append("|" + " |".join(cells) + " |") |
| lines.append("+------" * len(row) + "+") |
| return "\n".join(lines) |
| |
| def get_max_tile(board: List[List[int]]) -> int: |
| """Get highest tile value.""" |
| return max(cell for row in board for cell in row) |
| ``` |
|
|
| ## Part 4: Reward Function Design |
|
|
| The reward function is crucial for RL. We consider: |
|
|
| 1. **Success**: Did we reach 2048? |
| 2. **Progress**: What's the highest tile achieved? |
| 3. **Code Quality**: Did the generated code execute correctly? |
|
|
| ### Reward Implementation |
|
|
| ```python |
| import math |
| |
| def calculate_reward( |
| max_tile: int, |
| success: bool, |
| code_error: bool = False |
| ) -> float: |
| """ |
| Calculate reward for a 2048 game outcome. |
| |
| Args: |
| max_tile: Highest tile achieved (2, 4, 8, ..., 2048) |
| success: Whether we reached 2048 |
| code_error: Whether generated code had errors |
| |
| Returns: |
| Float reward value |
| """ |
| if code_error: |
| return -0.5 # Penalty for invalid code |
| |
| if success: |
| return 1.0 # Full reward for winning |
| |
| # Progress reward: log scale from 0 to 0.9 |
| if max_tile > 0: |
| progress = math.log2(max_tile) / math.log2(2048) |
| return min(0.9, progress) |
| |
| return 0.0 |
| |
| # Test reward function |
| test_cases = [ |
| (2048, True, False, "Won!"), |
| (1024, False, False, "Got to 1024"), |
| (512, False, False, "Got to 512"), |
| (64, False, False, "Early game"), |
| ] |
| |
| for max_tile, success, error, desc in test_cases: |
| reward = calculate_reward(max_tile, success, error) |
| print(f"{desc:20s} -> Reward: {reward:+.3f}") |
| ``` |
|
|
| ## Part 5: Strategy Generation |
|
|
| We'll train the model to generate Python strategy functions. |
|
|
| ### Prompt Template |
|
|
| ```python |
| SYSTEM_PROMPT = """You are an expert at playing 2048. Generate a Python function |
| that takes a board state and returns the best action (0=UP, 1=RIGHT, 2=DOWN, 3=LEFT). |
| |
| The board is a 4x4 list of integers. Empty cells are 0. |
| Your function should analyze the board and return an optimal move. |
| """ |
| |
| def create_prompt(board: List[List[int]]) -> str: |
| """Create prompt for strategy generation.""" |
| board_str = "\n".join(str(row) for row in board) |
| return f"""{SYSTEM_PROMPT} |
| |
| Current board: |
| {board_str} |
| |
| Generate a strategy function: |
| ```python |
| def strategy(board): |
| # Your code here |
| return action # 0, 1, 2, or 3 |
| ```""" |
| ``` |
| |
| ### Executing Generated Strategies |
|
|
| ```python |
| import ast |
| from typing import Callable |
| |
| def extract_and_execute_strategy( |
| generated_code: str, |
| board: List[List[int]], |
| timeout: float = 5.0 |
| ) -> tuple[int, bool]: |
| """ |
| Extract and execute a generated strategy function. |
| |
| Returns: |
| (action, success): The action to take and whether execution succeeded |
| """ |
| try: |
| # Extract code block |
| if "```python" in generated_code: |
| code = generated_code.split("```python")[1].split("```")[0] |
| else: |
| code = generated_code |
| |
| # Parse and validate AST |
| tree = ast.parse(code) |
| |
| # Execute in sandbox |
| namespace = {"board": board} |
| exec(compile(tree, "<strategy>", "exec"), namespace) |
| |
| # Call the strategy function |
| if "strategy" in namespace: |
| action = namespace["strategy"](board) |
| if action in [0, 1, 2, 3]: |
| return action, True |
| |
| return 0, False # Default action on failure |
| |
| except Exception as e: |
| print(f"Strategy execution error: {e}") |
| return 0, False |
| ``` |
| |
| ## Part 6: GRPO Training |
|
|
| GRPO (Group Relative Policy Optimization) is optimized for language models. |
|
|
| ### Training Configuration |
|
|
| ```python |
| from trl import GRPOConfig, GRPOTrainer |
| |
| grpo_config = GRPOConfig( |
| # Learning rate |
| learning_rate=2e-6, |
| |
| # Batch sizes |
| per_device_train_batch_size=4, |
| gradient_accumulation_steps=4, |
| |
| # Training duration |
| max_steps=200, |
| |
| # Memory optimization |
| bf16=True, |
| gradient_checkpointing=True, |
| |
| # Logging |
| logging_steps=1, |
| output_dir="./2048_grpo_output", |
| report_to="none", |
| ) |
| ``` |
|
|
| ### Training Loop |
|
|
| ```python |
| def train_2048_agent( |
| model, |
| tokenizer, |
| env, |
| config: GRPOConfig, |
| num_episodes: int = 100, |
| ): |
| """ |
| Train the model to play 2048 using GRPO. |
| """ |
| # Prepare model for training |
| FastLanguageModel.for_training(model) |
| |
| training_data = [] |
| |
| for episode in range(num_episodes): |
| # Reset environment |
| result = env.reset() |
| board = info_state_to_board(result.observation.info_state) |
| |
| episode_reward = 0 |
| steps = 0 |
| |
| while not result.done and steps < 1000: |
| # Generate strategy |
| prompt = create_prompt(board) |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=256, |
| temperature=0.7, |
| do_sample=True, |
| ) |
| |
| generated = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| |
| # Execute strategy |
| action, success = extract_and_execute_strategy(generated, board) |
| |
| # Take action in environment |
| env_action = OpenSpielAction(action_id=action, game_name="2048") |
| result = env.step(env_action) |
| |
| # Update board |
| board = info_state_to_board(result.observation.info_state) |
| episode_reward += result.reward if result.reward else 0 |
| steps += 1 |
| |
| # Calculate final reward |
| max_tile = get_max_tile(board) |
| final_reward = calculate_reward(max_tile, max_tile >= 2048) |
| |
| # Store for training |
| training_data.append({ |
| "prompt": prompt, |
| "response": generated, |
| "reward": final_reward, |
| }) |
| |
| if episode % 10 == 0: |
| print(f"Episode {episode}: Max tile={max_tile}, Reward={final_reward:.3f}") |
| |
| return training_data |
| ``` |
|
|
| ## Part 7: Deployment |
|
|
| After training, save and deploy your model. |
|
|
| ### Saving the Model |
|
|
| ```python |
| # Save LoRA adapters only |
| model.save_pretrained("./2048_strategy_model") |
| tokenizer.save_pretrained("./2048_strategy_model") |
| |
| # Save merged model for inference |
| model.save_pretrained_merged( |
| "./2048_strategy_model_merged", |
| tokenizer, |
| save_method="merged_16bit", |
| ) |
| ``` |
|
|
| ### Push to Hugging Face Hub |
|
|
| ```python |
| # Push to Hub |
| model.push_to_hub( |
| "your-username/2048-strategy-model", |
| tokenizer, |
| save_method="merged_16bit", |
| private=False, |
| ) |
| |
| print("Model deployed to: huggingface.co/your-username/2048-strategy-model") |
| ``` |
|
|
| ### Using the Trained Model |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| # Load trained model |
| model = AutoModelForCausalLM.from_pretrained("your-username/2048-strategy-model") |
| tokenizer = AutoTokenizer.from_pretrained("your-username/2048-strategy-model") |
| |
| # Generate strategy |
| def get_action(board: List[List[int]]) -> int: |
| prompt = create_prompt(board) |
| inputs = tokenizer(prompt, return_tensors="pt") |
| outputs = model.generate(**inputs, max_new_tokens=256) |
| generated = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| action, _ = extract_and_execute_strategy(generated, board) |
| return action |
| |
| # Play a game |
| with OpenSpielEnv.from_hub("openenv/openspiel-env") as env: |
| result = env.reset() |
| board = info_state_to_board(result.observation.info_state) |
| |
| while not result.done: |
| action = get_action(board) |
| result = env.step(OpenSpielAction(action_id=action, game_name="2048")) |
| board = info_state_to_board(result.observation.info_state) |
| |
| print(f"Final max tile: {get_max_tile(board)}") |
| ``` |
|
|
| ## Preventing Reward Hacking |
|
|
| Be aware of potential reward hacking strategies: |
|
|
| 1. **Code that modifies rewards** - Run in sandboxed environment |
| 2. **Infinite loops** - Set execution timeouts |
| 3. **Memory exhaustion** - Limit resource usage |
|
|
| ```python |
| import resource |
| import signal |
| |
| def safe_execute(code: str, board: List[List[int]], timeout: float = 5.0) -> int: |
| """Execute strategy with safety limits.""" |
| |
| def handler(signum, frame): |
| raise TimeoutError("Strategy timed out") |
| |
| # Set timeout |
| signal.signal(signal.SIGALRM, handler) |
| signal.alarm(int(timeout)) |
| |
| try: |
| # Set memory limit (100MB) |
| resource.setrlimit(resource.RLIMIT_AS, (100 * 1024 * 1024, -1)) |
| |
| # Execute in restricted namespace |
| namespace = {"board": board, "__builtins__": {"len": len, "max": max, "min": min}} |
| exec(code, namespace) |
| |
| return namespace.get("strategy", lambda b: 0)(board) |
| finally: |
| signal.alarm(0) |
| ``` |
|
|
| ## Summary |
|
|
| In this tutorial, you learned: |
|
|
| 1. **Model Setup**: Loading LLMs with Unsloth and LoRA |
| 2. **Environment Connection**: Using OpenEnv's 2048 environment |
| 3. **Reward Design**: Creating balanced reward functions |
| 4. **GRPO Training**: Training with reinforcement learning |
| 5. **Deployment**: Saving and sharing trained models |
|
|
| ## Next Steps |
|
|
| - Try different model architectures |
| - Experiment with reward function designs |
| - Train on other OpenEnv environments |
| - Share your trained models on Hugging Face Hub! |
|
|
| ## Related Resources |
|
|
| - [OpenEnv Getting Started](../auto_getting_started/index) |
| - [Building Custom Environments](../auto_getting_started/plot_03_building_environments) |
| - [GRPO Documentation](https://huggingface.co/docs/trl/grpo_trainer) |
| - [Unsloth Documentation](https://github.com/unslothai/unsloth) |
|
|