| # OpenEnv: Production RL Made Simple |
|
|
| <div align="center"> |
|
|
| <img src="https://upload.wikimedia.org/wikipedia/commons/1/10/PyTorch_logo_icon.svg" width="200" alt="PyTorch"> |
|
|
| ## From "Hello World" to RL Training in 5 Minutes โจ |
|
|
| **What if RL environments were as easy to use as REST APIs?** |
|
|
| That's OpenEnv. Type-safe. Isolated. Production-ready. ๐ฏ |
|
|
| [](https://colab.research.google.com/github/meta-pytorch/OpenEnv/blob/main/examples/OpenEnv_Tutorial.ipynb) |
| [](https://github.com/meta-pytorch/OpenEnv) |
| [](https://opensource.org/licenses/BSD-3-Clause) |
| [](https://pytorch.org/) |
|
|
| Author: [Sanyam Bhutani](http://twitter.com/bhutanisanyam1/) |
|
|
| </div> |
|
|
| ## Why OpenEnv? |
|
|
| Let's take a trip down memory lane: |
|
|
| It's 2016, RL is popular. You read some papers, it looks promising. |
|
|
| But in real world: Cartpole is the best you can run on a gaming GPU. |
|
|
| What do you do beyond Cartpole? |
|
|
| Fast-forward to 2025, GRPO is awesome and this time it's not JUST in theory, it works well in practise and is really here! |
|
|
| The problem still remains, how do you take these RL algorithms and take them beyond Cartpole? |
|
|
| A huge part of RL is giving your algorithms environment access to learn. |
|
|
| We are excited to introduce an Environment Spec for adding Open Environments for RL Training. This will allow you to focus on your experiments and allow everyone to bring their environments. |
|
|
| Focus on experiments, use OpenEnvironments, and build agents that go beyond Cartpole on a single spec. |
|
|
| --- |
|
|
| ## ๐ What You'll Learn |
|
|
| <table> |
| <tr> |
| <td width="50%"> |
|
|
| **๐ฏ Part 1-2: The Fundamentals** |
|
|
| - โก RL in 60 seconds |
| - ๐ค Why existing solutions fall short |
| - ๐ก The OpenEnv solution |
|
|
| </td> |
| <td width="50%"> |
|
|
| **๐๏ธ Part 3-5: The Architecture** |
|
|
| - ๐ง How OpenEnv works |
| - ๐ Exploring real code |
| - ๐ฎ OpenSpiel integration example |
|
|
| </td> |
| </tr> |
| <tr> |
| <td width="50%"> |
|
|
| **๐ฎ Part 6-8: Hands-On Demo** |
|
|
| - ๐ Use existing OpenSpiel environment |
| - ๐ค Test 4 different policies |
| - ๐ Watch learning happen live |
|
|
| </td> |
| <td width="50%"> |
|
|
| **๐ง Part 9-10: Going Further** |
|
|
| - ๐ฎ Switch to other OpenSpiel games |
| - โจ Build your own integration |
| - ๐ Deploy to production |
|
|
| </td> |
| </tr> |
| </table> |
|
|
| !!! tip "Pro Tip" |
| This notebook is designed to run top-to-bottom in Google Colab with zero setup! |
| |
| โฑ๏ธ **Time**: ~5 minutes | ๐ **Difficulty**: Beginner-friendly | ๐ฏ **Outcome**: Production-ready RL knowledge |
| |
| --- |
|
|
| ## ๐ Table of Contents |
|
|
| ### Foundation |
|
|
| - [Part 1: RL in 60 Seconds โฑ๏ธ](#part-1-rl-in-60-seconds) |
| - [Part 2: The Problem with Traditional RL ๐ค](#part-2-the-problem-with-traditional-rl) |
| - [Part 3: Setup ๐ ๏ธ](#part-3-setup) |
|
|
| ### Architecture |
|
|
| - [Part 4: The OpenEnv Pattern ๐๏ธ](#part-4-the-openenv-pattern) |
| - [Part 5: Example Integration - OpenSpiel ๐ฎ](#part-5-example-integration---openspiel) |
|
|
| ### Hands-On Demo |
|
|
| - [Part 6: Interactive Demo ๐ฎ](#part-6-using-real-openspiel) |
| - [Part 7: Four Policies ๐ค](#part-7-four-policies) |
| - [Part 8: Policy Competition! ๐](#part-8-policy-competition) |
|
|
| ### Advanced |
|
|
| - [Part 9: Using Real OpenSpiel ๐ฎ](#part-9-switching-to-other-games) |
| - [Part 10: Create Your Own Integration ๐ ๏ธ](#part-10-create-your-own-integration) |
|
|
| ### Wrap Up |
|
|
| - [Summary: Your Journey ๐](#summary-your-journey) |
| - [Resources ๐](#resources) |
|
|
| --- |
|
|
| (part-1-rl-in-60-seconds)= |
| ## Part 1: RL in 60 Seconds โฑ๏ธ |
|
|
| **Reinforcement Learning is simpler than you think.** |
|
|
| It's just a loop: |
|
|
| ```python |
| while not done: |
| observation = environment.observe() |
| action = policy.choose(observation) |
| reward = environment.step(action) |
| policy.learn(reward) |
| ``` |
|
|
| That's it. That's RL. |
|
|
| Let's see it in action: |
|
|
| ```python |
| import random |
| |
| print("๐ฒ " + "="*58 + " ๐ฒ") |
| print(" Number Guessing Game - The Simplest RL Example") |
| print("๐ฒ " + "="*58 + " ๐ฒ") |
| |
| # Environment setup |
| target = random.randint(1, 10) |
| guesses_left = 3 |
| |
| print(f"\n๐ฏ I'm thinking of a number between 1 and 10...") |
| print(f"๐ญ You have {guesses_left} guesses. Let's see how random guessing works!\n") |
| |
| # The RL Loop - Pure random policy (no learning!) |
| while guesses_left > 0: |
| # Policy: Random guessing (no learning yet!) |
| guess = random.randint(1, 10) |
| guesses_left -= 1 |
| |
| print(f"๐ญ Guess #{3-guesses_left}: {guess}", end=" โ ") |
| |
| # Reward signal (but we're not using it!) |
| if guess == target: |
| print("๐ Correct! +10 points") |
| break |
| elif abs(guess - target) <= 2: |
| print("๐ฅ Warm! (close)") |
| else: |
| print("โ๏ธ Cold! (far)") |
| else: |
| print(f"\n๐ Out of guesses. The number was {target}.") |
| |
| print("\n" + "="*62) |
| print("๐ก This is RL: Observe โ Act โ Reward โ Repeat") |
| print(" But this policy is terrible! It doesn't learn from rewards.") |
| print("="*62 + "\n") |
| ``` |
|
|
| **Output:** |
| ``` |
| ๐ฒ ========================================================== ๐ฒ |
| Number Guessing Game - The Simplest RL Example |
| ๐ฒ ========================================================== ๐ฒ |
| |
| ๐ฏ I'm thinking of a number between 1 and 10... |
| ๐ญ You have 3 guesses. Let's see how random guessing works! |
| |
| ๐ญ Guess #1: 2 โ โ๏ธ Cold! (far) |
| ๐ญ Guess #2: 10 โ ๐ Correct! +10 points |
| |
| ============================================================== |
| ๐ก This is RL: Observe โ Act โ Reward โ Repeat |
| But this policy is terrible! It doesn't learn from rewards. |
| ============================================================== |
| ``` |
|
|
| --- |
|
|
| (part-2-the-problem-with-traditional-rl)= |
| ## Part 2: The Problem with Traditional RL ๐ค |
|
|
| ### ๐ค Why Can't We Just Use OpenAI Gym? |
|
|
| Good question! Gym is great for research, but production needs more... |
|
|
| | Challenge | Traditional Approach | OpenEnv Solution | |
| |-----------|---------------------|------------------| |
| | **Type Safety** | โ `obs[0][3]` - what is this? | โ
`obs.info_state` - IDE knows! | |
| | **Isolation** | โ Same process (can crash your training) | โ
Docker containers (fully isolated) | |
| | **Deployment** | โ "Works on my machine" ๐คท | โ
Same container everywhere ๐ณ | |
| | **Scaling** | โ Hard to distribute | โ
Deploy to Kubernetes โธ๏ธ | |
| | **Language** | โ Python only | โ
Any language (HTTP API) ๐ | |
| | **Debugging** | โ Cryptic numpy errors | โ
Clear type errors ๐ | |
|
|
| ### ๐ก The OpenEnv Philosophy |
|
|
| **"RL environments should be like microservices"** |
|
|
| Think of it like this: You don't run your database in the same process as your web server, right? Same principle! |
|
|
| - ๐ **Isolated**: Run in containers (security + stability) |
| - ๐ **Standard**: HTTP API, works everywhere |
| - ๐ฆ **Versioned**: Docker images (reproducibility!) |
| - ๐ **Scalable**: Deploy to cloud with one command |
| - ๐ก๏ธ **Type-safe**: Catch bugs before they happen |
| - ๐ **Portable**: Works on Mac, Linux, Windows, Cloud |
|
|
| ### The Architecture |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ YOUR TRAINING CODE โ |
| โ โ |
| โ env = OpenSpielEnv(...) โ Import the client โ |
| โ result = env.reset() โ Type-safe! โ |
| โ result = env.step(action) โ Type-safe! โ |
| โ โ |
| โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ |
| โ HTTP/JSON (Language-Agnostic) |
| โ POST /reset, POST /step, GET /state |
| โ |
| โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ DOCKER CONTAINER โ |
| โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ FastAPI Server โ โ |
| โ โ โโ Environment (reset, step, state) โ โ |
| โ โ โโ Your Game/Simulation Logic โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ |
| โ Isolated โข Reproducible โข Secure โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| !!! info "Key Insight" |
| You never see HTTP details - just clean Python methods! |
| |
| ```python |
| env.reset() # Under the hood: HTTP POST to /reset |
| env.step(...) # Under the hood: HTTP POST to /step |
| env.state() # Under the hood: HTTP GET to /state |
| ``` |
| |
| The magic? OpenEnv handles all the plumbing. You focus on RL! โจ |
| |
| --- |
|
|
| (part-3-setup)= |
| ## Part 3: Setup ๐ ๏ธ |
|
|
| **Running in Colab?** This cell will clone OpenEnv and install dependencies automatically. |
|
|
| **Running locally?** Make sure you're in the OpenEnv directory. |
|
|
| ```ipython3 |
| # Detect environment |
| try: |
| import google.colab |
| IN_COLAB = True |
| print("๐ Running in Google Colab - Perfect!") |
| except ImportError: |
| IN_COLAB = False |
| print("๐ป Running locally - Nice!") |
| |
| if IN_COLAB: |
| print("\n๐ฆ Cloning OpenEnv repository...") |
| !git clone https://github.com/meta-pytorch/OpenEnv.git > /dev/null 2>&1 |
| %cd OpenEnv |
| |
| print("๐ Installing dependencies (this takes ~10 seconds)...") |
| !pip install -q fastapi uvicorn requests |
| |
| import sys |
| sys.path.insert(0, './src') |
| print("\nโ
Setup complete! Everything is ready to go! ๐") |
| else: |
| import sys |
| from pathlib import Path |
| sys.path.insert(0, str(Path.cwd().parent / 'src')) |
| print("โ
Using local OpenEnv installation") |
| |
| print("\n๐ Ready to explore OpenEnv and build amazing things!") |
| print("๐ก Tip: Run cells top-to-bottom for the best experience.\n") |
| ``` |
|
|
| **Output:** |
| ``` |
| ๐ป Running locally - Nice! |
| โ
Using local OpenEnv installation |
| |
| ๐ Ready to explore OpenEnv and build amazing things! |
| ๐ก Tip: Run cells top-to-bottom for the best experience. |
| ``` |
|
|
| --- |
|
|
| (part-4-the-openenv-pattern)= |
| ## Part 4: The OpenEnv Pattern ๐๏ธ |
|
|
| ### Every OpenEnv Environment Has 3 Components: |
|
|
| ``` |
| src/envs/your_env/ |
| โโโ ๐ models.py โ Type-safe contracts |
| โ (Action, Observation, State) |
| โ |
| โโโ ๐ฑ client.py โ What YOU import |
| โ (HTTPEnvClient implementation) |
| โ |
| โโโ ๐ฅ๏ธ server/ |
| โโโ environment.py โ Game/simulation logic |
| โโโ app.py โ FastAPI server |
| โโโ Dockerfile โ Container definition |
| ``` |
|
|
| Let's explore the actual OpenEnv code to see how this works: |
|
|
| ```python |
| # Import OpenEnv's core abstractions |
| from core.env_server import Environment, Action, Observation, State |
| from core.http_env_client import HTTPEnvClient |
| |
| print("="*70) |
| print(" ๐งฉ OPENENV CORE ABSTRACTIONS") |
| print("="*70) |
| |
| print(""" |
| ๐ฅ๏ธ SERVER SIDE (runs in Docker): |
| |
| class Environment(ABC): |
| '''Base class for all environment implementations''' |
| |
| @abstractmethod |
| def reset(self) -> Observation: |
| '''Start new episode''' |
| |
| @abstractmethod |
| def step(self, action: Action) -> Observation: |
| '''Execute action, return observation''' |
| |
| @property |
| def state(self) -> State: |
| '''Get episode metadata''' |
| |
| ๐ฑ CLIENT SIDE (your training code): |
| |
| class HTTPEnvClient(ABC): |
| '''Base class for HTTP clients''' |
| |
| def reset(self) -> StepResult: |
| # HTTP POST /reset |
| |
| def step(self, action) -> StepResult: |
| # HTTP POST /step |
| |
| def state(self) -> State: |
| # HTTP GET /state |
| """) |
| |
| print("="*70) |
| print("\nโจ Same interface on both sides - communication via HTTP!") |
| print("๐ฏ You focus on RL, OpenEnv handles the infrastructure.\n") |
| ``` |
|
|
| **Output:** |
| ``` |
| ====================================================================== |
| ๐งฉ OPENENV CORE ABSTRACTIONS |
| ====================================================================== |
|
|
| ๐ฅ๏ธ SERVER SIDE (runs in Docker): |
|
|
| class Environment(ABC): |
| '''Base class for all environment implementations''' |
| |
| @abstractmethod |
| def reset(self) -> Observation: |
| '''Start new episode''' |
| |
| @abstractmethod |
| def step(self, action: Action) -> Observation: |
| '''Execute action, return observation''' |
| |
| @property |
| def state(self) -> State: |
| '''Get episode metadata''' |
| |
| ๐ฑ CLIENT SIDE (your training code): |
|
|
| class HTTPEnvClient(ABC): |
| '''Base class for HTTP clients''' |
| |
| def reset(self) -> StepResult: |
| # HTTP POST /reset |
| |
| def step(self, action) -> StepResult: |
| # HTTP POST /step |
| |
| def state(self) -> State: |
| # HTTP GET /state |
| |
| ====================================================================== |
|
|
| โจ Same interface on both sides - communication via HTTP! |
| ๐ฏ You focus on RL, OpenEnv handles the infrastructure. |
| ``` |
| |
| --- |
| |
| (part-5-example-integration---openspiel)= |
| ## Part 5: Example Integration - OpenSpiel ๐ฎ |
| |
| ### What is OpenSpiel? |
| |
| **OpenSpiel** is a library from DeepMind with **70+ game environments** for RL research. |
| |
| ### OpenEnv's Integration |
| |
| We've wrapped **6 OpenSpiel games** following the OpenEnv pattern: |
| |
| | **๐ฏ Single-Player** | **๐ฅ Multi-Player** | |
| |---------------------|---------------------| |
| | 1. **Catch** - Catch falling ball | 5. **Tic-Tac-Toe** - Classic 3ร3 | |
| | 2. **Cliff Walking** - Navigate grid | 6. **Kuhn Poker** - Imperfect info poker | |
| | 3. **2048** - Tile puzzle | | |
| | 4. **Blackjack** - Card game | | |
| |
| This shows how OpenEnv can wrap **any** existing RL library! |
| |
| ```python |
| from envs.openspiel_env.client import OpenSpielEnv |
| |
| print("="*70) |
| print(" ๐ HOW OPENENV WRAPS OPENSPIEL") |
| print("="*70) |
| |
| print(""" |
| class OpenSpielEnv(HTTPEnvClient[OpenSpielAction, OpenSpielObservation]): |
| |
| def _step_payload(self, action: OpenSpielAction) -> dict: |
| '''Convert typed action to JSON for HTTP''' |
| return { |
| "action_id": action.action_id, |
| "game_name": action.game_name, |
| } |
| |
| def _parse_result(self, payload: dict) -> StepResult: |
| '''Parse HTTP JSON response into typed observation''' |
| return StepResult( |
| observation=OpenSpielObservation(...), |
| reward=payload['reward'], |
| done=payload['done'] |
| ) |
| |
| """) |
| |
| print("โ" * 70) |
| print("\nโจ Usage (works for ALL OpenEnv environments):") |
| print(""" |
| env = OpenSpielEnv(base_url="http://localhost:8000") |
|
|
| result = env.reset() |
| # Returns StepResult[OpenSpielObservation] - Type safe! |
|
|
| result = env.step(OpenSpielAction(action_id=2, game_name="catch")) |
| # Type checker knows this is valid! |
|
|
| state = env.state() |
| # Returns OpenSpielState |
| """) |
|
|
| print("โ" * 70) |
| print("\n๐ฏ This pattern works for ANY environment you want to wrap!\n") |
| ``` |
| |
| **Output:** |
| ``` |
| ====================================================================== |
| ๐ HOW OPENENV WRAPS OPENSPIEL |
| ====================================================================== |
|
|
| class OpenSpielEnv(HTTPEnvClient[OpenSpielAction, OpenSpielObservation]): |
|
|
| def _step_payload(self, action: OpenSpielAction) -> dict: |
| '''Convert typed action to JSON for HTTP''' |
| return { |
| "action_id": action.action_id, |
| "game_name": action.game_name, |
| } |
| |
| def _parse_result(self, payload: dict) -> StepResult: |
| '''Parse HTTP JSON response into typed observation''' |
| return StepResult( |
| observation=OpenSpielObservation(...), |
| reward=payload['reward'], |
| done=payload['done'] |
| ) |
| |
|
|
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
|
|
| โจ Usage (works for ALL OpenEnv environments): |
|
|
| env = OpenSpielEnv(base_url="http://localhost:8000") |
| |
| result = env.reset() |
| # Returns StepResult[OpenSpielObservation] - Type safe! |
| |
| result = env.step(OpenSpielAction(action_id=2, game_name="catch")) |
| # Type checker knows this is valid! |
| |
| state = env.state() |
| # Returns OpenSpielState |
| |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| |
| ๐ฏ This pattern works for ANY environment you want to wrap! |
| ``` |
| |
| ### Type-Safe Models |
| |
| ```python |
| # Import OpenSpiel integration models |
| from envs.openspiel_env.models import ( |
| OpenSpielAction, |
| OpenSpielObservation, |
| OpenSpielState |
| ) |
| from dataclasses import fields |
| |
| print("="*70) |
| print(" ๐ฎ OPENSPIEL INTEGRATION - TYPE-SAFE MODELS") |
| print("="*70) |
|
|
| print("\n๐ค OpenSpielAction (what you send):") |
| print(" " + "โ" * 64) |
| for field in fields(OpenSpielAction): |
| print(f" โข {field.name:20s} : {field.type}") |
| |
| print("\n๐ฅ OpenSpielObservation (what you receive):") |
| print(" " + "โ" * 64) |
| for field in fields(OpenSpielObservation): |
| print(f" โข {field.name:20s} : {field.type}") |
| |
| print("\n๐ OpenSpielState (episode metadata):") |
| print(" " + "โ" * 64) |
| for field in fields(OpenSpielState): |
| print(f" โข {field.name:20s} : {field.type}") |
| |
| print("\n" + "="*70) |
| print("\n๐ก Type safety means:") |
| print(" โ
Your IDE autocompletes these fields") |
| print(" โ
Typos are caught before running") |
| print(" โ
Refactoring is safe") |
| print(" โ
Self-documenting code\n") |
| ``` |
| |
| **Output:** |
| ``` |
| ====================================================================== |
| ๐ฎ OPENSPIEL INTEGRATION - TYPE-SAFE MODELS |
| ====================================================================== |
| |
| ๐ค OpenSpielAction (what you send): |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โข metadata : typing.Dict[str, typing.Any] |
| โข action_id : int |
| โข game_name : str |
| โข game_params : Dict[str, Any] |
| |
| ๐ฅ OpenSpielObservation (what you receive): |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โข done : <class 'bool'> |
| โข reward : typing.Union[bool, int, float, NoneType] |
| โข metadata : typing.Dict[str, typing.Any] |
| โข info_state : List[float] |
| โข legal_actions : List[int] |
| โข game_phase : str |
| โข current_player_id : int |
| โข opponent_last_action : Optional[int] |
| |
| ๐ OpenSpielState (episode metadata): |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โข episode_id : typing.Optional[str] |
| โข step_count : <class 'int'> |
| โข game_name : str |
| โข agent_player : int |
| โข opponent_policy : str |
| โข game_params : Dict[str, Any] |
| โข num_players : int |
| |
| ====================================================================== |
| |
| ๐ก Type safety means: |
| โ
Your IDE autocompletes these fields |
| โ
Typos are caught before running |
| โ
Refactoring is safe |
| โ
Self-documenting code |
| ``` |
| |
| ### How the Client Works |
| |
| The client **inherits from HTTPEnvClient** and implements 3 methods: |
| |
| 1. `_step_payload()` - Convert action โ JSON |
| 2. `_parse_result()` - Parse JSON โ typed observation |
| 3. `_parse_state()` - Parse JSON โ state |
| |
| That's it! The base class handles all HTTP communication. |
| |
| --- |
| |
| (part-6-using-real-openspiel)= |
| ## Part 6: Using Real OpenSpiel ๐ฎ |
| |
| <div style="text-align: center; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 30px; border-radius: 15px; margin: 30px 0;"> |
| |
| ### Now let's USE a production environment! |
| |
| We'll play **Catch** using OpenEnv's **OpenSpiel integration** ๐ฏ |
| |
| This is a REAL environment running in production at companies! |
| |
| **Get ready for:** |
| |
| - ๐ Using existing environments (not building) |
| - ๐ค Testing policies against real games |
| - ๐ Live gameplay visualization |
| - ๐ฏ Production-ready patterns |
| |
| </div> |
| |
| ### The Game: Catch ๐ด๐ |
| |
| ``` |
| โฌ โฌ ๐ด โฌ โฌ |
| โฌ โฌ โฌ โฌ โฌ |
| โฌ โฌ โฌ โฌ โฌ Ball |
| โฌ โฌ โฌ โฌ โฌ |
| โฌ โฌ โฌ โฌ โฌ falls |
| โฌ โฌ โฌ โฌ โฌ |
| โฌ โฌ โฌ โฌ โฌ down |
| โฌ โฌ โฌ โฌ โฌ |
| โฌ โฌ โฌ โฌ โฌ |
| โฌ โฌ ๐ โฌ โฌ |
| Paddle |
| ``` |
| |
| **Rules:** |
| |
| - 10ร5 grid |
| - Ball falls from random column |
| - Move paddle left/right to catch it |
| |
| **Actions:** |
| |
| - `0` = Move LEFT โฌ
๏ธ |
| - `1` = STAY ๐ |
| - `2` = Move RIGHT โก๏ธ |
| |
| **Reward:** |
| |
| - `+1` if caught ๐ |
| - `0` if missed ๐ข |
| |
| !!! note "Why Catch?" |
| - Simple rules (easy to understand) |
| - Fast episodes (~5 steps) |
| - Clear success/failure |
| - Part of OpenSpiel's 70+ games! |
| |
| **๐ก The Big Idea:** |
| Instead of building this from scratch, we'll USE OpenEnv's existing OpenSpiel integration. Same interface, but production-ready! |
| |
| ```python |
| from envs.openspiel_env import OpenSpielEnv |
| from envs.openspiel_env.models import ( |
| OpenSpielAction, |
| OpenSpielObservation, |
| OpenSpielState |
| ) |
| from dataclasses import fields |
| |
| print("๐ฎ " + "="*64 + " ๐ฎ") |
| print(" โ
Importing Real OpenSpiel Environment!") |
| print("๐ฎ " + "="*64 + " ๐ฎ\n") |
| |
| print("๐ฆ What we just imported:") |
| print(" โข OpenSpielEnv - HTTP client for OpenSpiel games") |
| print(" โข OpenSpielAction - Type-safe actions") |
| print(" โข OpenSpielObservation - Type-safe observations") |
| print(" โข OpenSpielState - Episode metadata\n") |
| |
| print("๐ OpenSpielObservation fields:") |
| print(" " + "โ" * 60) |
| for field in fields(OpenSpielObservation): |
| print(f" โข {field.name:25s} : {field.type}") |
| |
| print("\n" + "="*70) |
| print("\n๐ก This is REAL OpenEnv code - used in production!") |
| print(" โข Wraps 6 OpenSpiel games (Catch, Tic-Tac-Toe, Poker, etc.)") |
| print(" โข Type-safe actions and observations") |
| print(" โข Works via HTTP (we'll see that next!)\n") |
| ``` |
| |
| **Output:** |
| ``` |
| ๐ฎ ================================================================ ๐ฎ |
| โ
Importing Real OpenSpiel Environment! |
| ๐ฎ ================================================================ ๐ฎ |
| |
| ๐ฆ What we just imported: |
| โข OpenSpielEnv - HTTP client for OpenSpiel games |
| โข OpenSpielAction - Type-safe actions |
| โข OpenSpielObservation - Type-safe observations |
| โข OpenSpielState - Episode metadata |
| |
| ๐ OpenSpielObservation fields: |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โข done : <class 'bool'> |
| โข reward : typing.Union[bool, int, float, NoneType] |
| โข metadata : typing.Dict[str, typing.Any] |
| โข info_state : List[float] |
| โข legal_actions : List[int] |
| โข game_phase : str |
| โข current_player_id : int |
| โข opponent_last_action : Optional[int] |
| |
| ====================================================================== |
| |
| ๐ก This is REAL OpenEnv code - used in production! |
| โข Wraps 6 OpenSpiel games (Catch, Tic-Tac-Toe, Poker, etc.) |
| โข Type-safe actions and observations |
| โข Works via HTTP (we'll see that next!) |
| ``` |
| |
| --- |
| |
| (part-7-four-policies)= |
| ## Part 7: Four Policies ๐ค |
| |
| Let's test 4 different AI strategies: |
| |
| | Policy | Strategy | Expected Performance | |
| |--------|----------|----------------------| |
| | **๐ฒ Random** | Pick random action every step | ~20% (pure luck) | |
| | **๐ Always Stay** | Never move, hope ball lands in center | ~20% (terrible!) | |
| | **๐ง Smart** | Move paddle toward ball | 100% (optimal!) | |
| | **๐ Learning** | Start random, learn smart strategy | ~85% (improves over time) | |
| |
| **๐ก These policies work with ANY OpenSpiel game!** |
| |
| ```python |
| import random |
| |
| # ============================================================================ |
| # POLICIES - Different AI strategies (adapted for OpenSpiel) |
| # ============================================================================ |
| |
| class RandomPolicy: |
| """Baseline: Pure random guessing.""" |
| name = "๐ฒ Random Guesser" |
| |
| def select_action(self, obs: OpenSpielObservation) -> int: |
| return random.choice(obs.legal_actions) |
| |
| |
| class AlwaysStayPolicy: |
| """Bad strategy: Never moves.""" |
| name = "๐ Always Stay" |
| |
| def select_action(self, obs: OpenSpielObservation) -> int: |
| return 1 # STAY |
| |
| |
| class SmartPolicy: |
| """Optimal: Move paddle toward ball.""" |
| name = "๐ง Smart Heuristic" |
| |
| def select_action(self, obs: OpenSpielObservation) -> int: |
| # Parse OpenSpiel observation |
| # For Catch: info_state is a flattened 10x5 grid |
| # Ball position and paddle position encoded in the vector |
| info_state = obs.info_state |
| |
| # Find ball and paddle positions from info_state |
| # Catch uses a 10x5 grid, so 50 values |
| grid_size = 5 |
| |
| # Find positions (ball = 1.0 in the flattened grid, paddle = 1.0 in the last row of the flattened grid) |
| ball_col = None |
| paddle_col = None |
| |
| for idx, val in enumerate(info_state): |
| if abs(val - 1.0) < 0.01: # Ball |
| ball_col = idx % grid_size |
| break |
| |
| last_row = info_state[-grid_size:] |
| paddle_col = last_row.index(1.0) # Paddle |
| |
| if ball_col is not None and paddle_col is not None: |
| if paddle_col < ball_col: |
| return 2 # Move RIGHT |
| elif paddle_col > ball_col: |
| return 0 # Move LEFT |
| |
| return 1 # STAY (fallback) |
| |
| |
| class LearningPolicy: |
| """Simulated RL: Epsilon-greedy exploration.""" |
| name = "๐ Learning Agent" |
| |
| def __init__(self): |
| self.steps = 0 |
| self.smart_policy = SmartPolicy() |
| |
| def select_action(self, obs: OpenSpielObservation) -> int: |
| self.steps += 1 |
| |
| # Decay exploration rate over time |
| epsilon = max(0.1, 1.0 - (self.steps / 100)) |
| |
| if random.random() < epsilon: |
| # Explore: random action |
| return random.choice(obs.legal_actions) |
| else: |
| # Exploit: use smart strategy |
| return self.smart_policy.select_action(obs) |
| |
| |
| print("๐ค " + "="*64 + " ๐ค") |
| print(" โ
4 Policies Created (Adapted for OpenSpiel)!") |
| print("๐ค " + "="*64 + " ๐ค\n") |
| |
| policies = [RandomPolicy(), AlwaysStayPolicy(), SmartPolicy(), LearningPolicy()] |
| for i, policy in enumerate(policies, 1): |
| print(f" {i}. {policy.name}") |
| |
| print("\n๐ก These policies work with OpenSpielObservation!") |
| print(" โข Read info_state (flattened grid)") |
| print(" โข Use legal_actions") |
| print(" โข Work with ANY OpenSpiel game that exposes these!\n") |
| ``` |
| |
| **Output:** |
| ``` |
| ๐ค ================================================================ ๐ค |
| โ
4 Policies Created (Adapted for OpenSpiel)! |
| ๐ค ================================================================ ๐ค |
| |
| 1. ๐ฒ Random Guesser |
| 2. ๐ Always Stay |
| 3. ๐ง Smart Heuristic |
| 4. ๐ Learning Agent |
| |
| ๐ก These policies work with OpenSpielObservation! |
| โข Read info_state (flattened grid) |
| โข Use legal_actions |
| โข Work with ANY OpenSpiel game that exposes these! |
| ``` |
| |
| --- |
| |
| (part-8-policy-competition)= |
| ## Part 8: Policy Competition! ๐ |
| |
| Let's run **50 episodes** for each policy against **REAL OpenSpiel** and see who wins! |
| |
| This is production code - every action is an HTTP call to the OpenSpiel server! |
| |
| ```python |
| def evaluate_policies(env, num_episodes=50): |
| """Compare all policies over many episodes using real OpenSpiel.""" |
| policies = [ |
| RandomPolicy(), |
| AlwaysStayPolicy(), |
| SmartPolicy(), |
| LearningPolicy(), |
| ] |
| |
| print("\n๐ " + "="*66 + " ๐") |
| print(f" POLICY SHOWDOWN - {num_episodes} Episodes Each") |
| print(f" Playing against REAL OpenSpiel Catch!") |
| print("๐ " + "="*66 + " ๐\n") |
| |
| results = [] |
| for policy in policies: |
| print(f"โก Testing {policy.name}...", end=" ") |
| successes = sum(run_episode(env, policy, visualize=False) |
| for _ in range(num_episodes)) |
| success_rate = (successes / num_episodes) * 100 |
| results.append((policy.name, success_rate, successes)) |
| print(f"โ Done!") |
| |
| print("\n" + "="*70) |
| print(" ๐ FINAL RESULTS") |
| print("="*70 + "\n") |
| |
| # Sort by success rate (descending) |
| results.sort(key=lambda x: x[1], reverse=True) |
| |
| # Award medals to top 3 |
| medals = ["๐ฅ", "๐ฅ", "๐ฅ", " "] |
| |
| for i, (name, rate, successes) in enumerate(results): |
| medal = medals[i] |
| bar = "โ" * int(rate / 2) |
| print(f"{medal} {name:25s} [{bar:<50}] {rate:5.1f}% ({successes}/{num_episodes})") |
| |
| print("\n" + "="*70) |
| print("\nโจ Key Insights:") |
| print(" โข Random (~20%): Baseline - pure luck ๐ฒ") |
| print(" โข Always Stay (~20%): Bad strategy - stays center ๐") |
| print(" โข Smart (100%): Optimal - perfect play! ๐ง ") |
| print(" โข Learning (~85%): Improves over time ๐") |
| print("\n๐ This is Reinforcement Learning + OpenEnv in action:") |
| print(" 1. We USED existing OpenSpiel environment (didn't build it)") |
| print(" 2. Type-safe communication over HTTP") |
| print(" 3. Same code works for ANY OpenSpiel game") |
| print(" 4. Production-ready architecture\n") |
| |
| # Run the epic competition! |
| print("๐ฎ Starting the showdown against REAL OpenSpiel...\n") |
| evaluate_policies(client, num_episodes=50) |
| ``` |
| |
| --- |
| |
| (part-9-switching-to-other-games)= |
| ## Part 9: Switching to Other Games ๐ฎ |
| |
| ### What We Just Used: Real OpenSpiel! ๐ |
| |
| In Parts 6-8, we **USED** the existing OpenSpiel Catch environment: |
| |
| | What We Did | How It Works | |
| |-------------|--------------| |
| | **Imported** | OpenSpielEnv client (pre-built) | |
| | **Started** | OpenSpiel server via uvicorn | |
| | **Connected** | HTTP client to server | |
| | **Played** | Real OpenSpiel Catch game | |
| |
| **๐ฏ This is production code!** Every action was an HTTP call to a real OpenSpiel environment. |
| |
| ### ๐ฎ 6 Games Available - Same Interface! |
| |
| The beauty of OpenEnv? **Same code, different games!** |
| |
| ```python |
| # We just used Catch |
| env = OpenSpielEnv(base_url="http://localhost:8000") |
| # game_name="catch" was set via environment variable |
|
|
| # Want Tic-Tac-Toe instead? Just change the game! |
| # Start server with: OPENSPIEL_GAME=tic_tac_toe uvicorn ... |
| # Same client code works! |
| ``` |
| |
| **๐ฎ All 6 Games:** |
| |
| 1. โ
**`catch`** - What we just used! |
| 2. **`tic_tac_toe`** - Classic 3ร3 |
| 3. **`kuhn_poker`** - Imperfect information poker |
| 4. **`cliff_walking`** - Grid navigation |
| 5. **`2048`** - Tile puzzle |
| 6. **`blackjack`** - Card game |
| |
| **All use the exact same OpenSpielEnv client!** |
| |
| ### Try Another Game (Optional): |
| |
| ```python |
| # Stop the current server (kill the server_process) |
| # Then start a new game: |
|
|
| server_process = subprocess.Popen( |
| [sys.executable, "-m", "uvicorn", |
| "envs.openspiel_env.server.app:app", |
| "--host", "0.0.0.0", |
| "--port", "8000"], |
| env={**os.environ, |
| "PYTHONPATH": f"{work_dir}/src", |
| "OPENSPIEL_GAME": "tic_tac_toe", # Changed! |
| "OPENSPIEL_AGENT_PLAYER": "0", |
| "OPENSPIEL_OPPONENT_POLICY": "random"}, |
| # ... rest of config |
| ) |
| |
| # Same client works! |
| client = OpenSpielEnv(base_url="http://localhost:8000") |
| result = client.reset() # Now playing Tic-Tac-Toe! |
| ``` |
| |
| **๐ก Key Insight**: You don't rebuild anything - you just USE different games with the same client! |
| |
| --- |
| |
| (part-10-create-your-own-integration)= |
| ## Part 10: Create Your Own Integration ๐ ๏ธ |
| |
| ### The 5-Step Pattern |
| |
| Want to wrap your own environment in OpenEnv? Here's how: |
| |
| ### Step 1: Define Types (`models.py`) |
| |
| ```python |
| from dataclasses import dataclass |
| from core.env_server import Action, Observation, State |
|
|
| @dataclass |
| class YourAction(Action): |
| action_value: int |
| # Add your action fields |
| |
| @dataclass |
| class YourObservation(Observation): |
| state_data: List[float] |
| done: bool |
| reward: float |
| # Add your observation fields |
| |
| @dataclass |
| class YourState(State): |
| episode_id: str |
| step_count: int |
| # Add your state fields |
| ``` |
| |
| ### Step 2: Implement Environment (`server/environment.py`) |
|
|
| ```python |
| from core.env_server import Environment |
| |
| class YourEnvironment(Environment): |
| def reset(self) -> Observation: |
| # Initialize your game/simulation |
| return YourObservation(...) |
| |
| def step(self, action: Action) -> Observation: |
| # Execute action, update state |
| return YourObservation(...) |
| |
| @property |
| def state(self) -> State: |
| return self._state |
| ``` |
|
|
| ### Step 3: Create Client (`client.py`) |
|
|
| ```python |
| from core.http_env_client import HTTPEnvClient |
| from core.types import StepResult |
| |
| class YourEnv(HTTPEnvClient[YourAction, YourObservation]): |
| def _step_payload(self, action: YourAction) -> dict: |
| """Convert action to JSON""" |
| return {"action_value": action.action_value} |
| |
| def _parse_result(self, payload: dict) -> StepResult: |
| """Parse JSON to observation""" |
| return StepResult( |
| observation=YourObservation(...), |
| reward=payload['reward'], |
| done=payload['done'] |
| ) |
| |
| def _parse_state(self, payload: dict) -> YourState: |
| return YourState(...) |
| ``` |
|
|
| ### Step 4: Create Server (`server/app.py`) |
|
|
| ```python |
| from core.env_server import create_fastapi_app |
| from .your_environment import YourEnvironment |
| |
| env = YourEnvironment() |
| app = create_fastapi_app(env) |
| |
| # That's it! OpenEnv creates all endpoints for you. |
| ``` |
|
|
| ### Step 5: Dockerize (`server/Dockerfile`) |
|
|
| ```dockerfile |
| FROM python:3.11-slim |
| |
| WORKDIR /app |
| COPY requirements.txt . |
| RUN pip install --no-cache-dir -r requirements.txt |
| |
| COPY . . |
| CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"] |
| ``` |
|
|
| ### ๐ Examples to Study |
|
|
| OpenEnv includes 3 complete examples: |
|
|
| 1. **`src/envs/echo_env/`** |
| - Simplest possible environment |
| - Great for testing and learning |
| |
| 2. **`src/envs/openspiel_env/`** |
| - Wraps external library (OpenSpiel) |
| - Shows integration pattern |
| - 6 games in one integration |
|
|
| 3. **`src/envs/coding_env/`** |
| - Python code execution environment |
| - Shows complex use case |
| - Security considerations |
| |
| **๐ก Study these to understand the patterns!** |
| |
| --- |
| |
| (summary-your-journey)= |
| ## ๐ Summary: Your Journey |
| |
| ### What You Learned |
| |
| <table> |
| <tr> |
| <td width="50%" style="vertical-align: top;"> |
| |
| ### ๐ Concepts |
| |
| โ
**RL Fundamentals** |
| |
| - The observe-act-reward loop |
| - What makes good policies |
| - Exploration vs exploitation |
| |
| โ
**OpenEnv Architecture** |
| |
| - Client-server separation |
| - Type-safe contracts |
| - HTTP communication layer |
| |
| โ
**Production Patterns** |
| |
| - Docker isolation |
| - API design |
| - Reproducible deployments |
| |
| </td> |
| <td width="50%" style="vertical-align: top;"> |
| |
| ### ๐ ๏ธ Skills |
| |
| โ
**Using Environments** |
| |
| - Import OpenEnv clients |
| - Call reset/step/state |
| - Work with typed observations |
| |
| โ
**Building Environments** |
| |
| - Define type-safe models |
| - Implement Environment class |
| - Create HTTPEnvClient |
| |
| โ
**Testing & Debugging** |
| |
| - Compare policies |
| - Visualize episodes |
| - Measure performance |
| |
| </td> |
| </tr> |
| </table> |
| |
| ### OpenEnv vs Traditional RL |
| |
| | Feature | Traditional (Gym) | OpenEnv | Winner | |
| |---------|------------------|---------|--------| |
| | **Type Safety** | โ Arrays, dicts | โ
Dataclasses | ๐ OpenEnv | |
| | **Isolation** | โ Same process | โ
Docker | ๐ OpenEnv | |
| | **Deployment** | โ Manual setup | โ
K8s-ready | ๐ OpenEnv | |
| | **Language** | โ Python only | โ
Any (HTTP) | ๐ OpenEnv | |
| | **Reproducibility** | โ "Works on my machine" | โ
Same everywhere | ๐ OpenEnv | |
| | **Community** | โ
Large ecosystem | ๐ก Growing | ๐ค Both! | |
| |
| !!! success "The Bottom Line" |
| OpenEnv brings **production engineering** to RL: |
| |
| - Same environments work locally and in production |
| - Type safety catches bugs early |
| - Docker isolation prevents conflicts |
| - HTTP API works with any language |
| |
| **It's RL for 2024 and beyond.** |
| |
| --- |
| |
| (resources)= |
| ## ๐ Resources |
| |
| ### ๐ Essential Links |
| |
| - **๐ OpenEnv GitHub**: https://github.com/meta-pytorch/OpenEnv |
| - **๐ฎ OpenSpiel**: https://github.com/google-deepmind/open_spiel |
| - **โก FastAPI Docs**: https://fastapi.tiangolo.com/ |
| - **๐ณ Docker Guide**: https://docs.docker.com/get-started/ |
| - **๐ฅ PyTorch**: https://pytorch.org/ |
| |
| ### ๐ Documentation Deep Dives |
| |
| - **Environment Creation Guide**: `src/envs/README.md` |
| - **OpenSpiel Integration**: `src/envs/openspiel_env/README.md` |
| - **Example Scripts**: `examples/` |
| - **RFC 001**: [Baseline API Specs](https://github.com/meta-pytorch/OpenEnv/pull/26) |
| |
| ### ๐ Community & Support |
| |
| **Supported by amazing organizations:** |
| |
| - ๐ฅ Meta PyTorch |
| - ๐ค Hugging Face |
| - โก Unsloth AI |
| - ๐ Reflection AI |
| - ๐ And many more! |
| |
| **License**: BSD 3-Clause (very permissive!) |
| |
| **Contributions**: Always welcome! Check out the issues tab. |
| |
| --- |
| |
| ### ๐ What's Next? |
| |
| 1. โญ **Star the repo** to show support and stay updated |
| 2. ๐ **Try modifying** the Catch game (make it harder? bigger grid?) |
| 3. ๐ฎ **Explore** other OpenSpiel games |
| 4. ๐ ๏ธ **Build** your own environment integration |
| 5. ๐ฌ **Share** what you build with the community! |
| |