Spaces:
Sleeping
Sleeping
2048 Example — Code-as-Policy Pattern
Full code: 2048_example.py
Key Insight: LLM Writes Code, Not Moves
The LLM does NOT play move-by-move. Instead:
- LLM receives a prompt asking it to write a
strategy(board)function - LLM generates Python code wrapped in triple backticks
- Code is extracted, sandboxed (no global access), and executed against the live game
- Reward is based on whether the strategy works
This is "code-as-policy" — the LLM generates an algorithm, not individual actions.
The Prompt
Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "0", "1", "2", "3" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
return "0" # Example
All helper functions should be inside def strategy. Only output the short function strategy.
## Three Reward Functions
| Function | Score | Condition |
|---|---|---|
| `function_works` | +1.0 | Valid Python that compiles |
| | -0.5 | Right structure but exec fails |
| | -2.0 | No function / syntax error |
| `no_cheating` | +1.0 | Only stdlib imports |
| | -20.0 | Non-stdlib imports |
| `strategy_succeeds` | +20.0 | Reaches tile 2048 |
| | +2.0 | Runs but doesn't win |
| | -1.0 | Timeout (>5 sec) |
| | -3.0 | Exception |
## Training Setup
- Model: `unsloth/gpt-oss-20b` with LoRA (r=4)
- Trainer: `trl.GRPOTrainer` with `trl.GRPOConfig`
- Dataset: 1000 copies of the same prompt (diversity from temperature=1.0)
- `num_generations=2`, `max_steps=600`, `lr=2e-4`, `optim=adamw_8bit`
- ~5 hours on T4, rewards start appearing after ~100 steps
## OpenEnv-Specific Patterns
```python
# Launch environment server
from unsloth import launch_openenv
launch_openenv = functools.partial(
launch_openenv,
working_directory=working_directory,
server="envs.openspiel_env.server.app:app",
environment={**os.environ, "OPENSPIEL_GAME": "2048", ...},
openenv_class=OpenSpielEnv,
)
# Reset and step
port, openenv_process = launch_openenv(port, openenv_process)
result = openenv_process.reset()
result = openenv_process.step(OpenSpielAction(action_id=0, game_name="2048"))
Safety Utilities (from Unsloth)
check_python_modules(code)— returns (ok, info); ok=True if only stdlib importscreate_locked_down_function(code)— sandboxed exec, no global variable leakageexecute_with_time_limit(seconds)— decorator for timeout enforcement