Spaces:

openenv-community
/

optigami

Sleeping

App Files Files Community

optigami / research /openenv /2048_pattern.md

sissississi

go-back (#6)

e9b7141 28 days ago

preview code

raw

history blame contribute delete

2.58 kB

2048 Example — Code-as-Policy Pattern

Full code: 2048_example.py

Key Insight: LLM Writes Code, Not Moves

The LLM does NOT play move-by-move. Instead:

LLM receives a prompt asking it to write a strategy(board) function
LLM generates Python code wrapped in triple backticks
Code is extracted, sandboxed (no global access), and executed against the live game
Reward is based on whether the strategy works

This is "code-as-policy" — the LLM generates an algorithm, not individual actions.

The Prompt

Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "0", "1", "2", "3" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
    return "0" # Example

All helper functions should be inside def strategy. Only output the short function strategy.


## Three Reward Functions

| Function | Score | Condition |
|---|---|---|
| `function_works` | +1.0 | Valid Python that compiles |
| | -0.5 | Right structure but exec fails |
| | -2.0 | No function / syntax error |
| `no_cheating` | +1.0 | Only stdlib imports |
| | -20.0 | Non-stdlib imports |
| `strategy_succeeds` | +20.0 | Reaches tile 2048 |
| | +2.0 | Runs but doesn't win |
| | -1.0 | Timeout (>5 sec) |
| | -3.0 | Exception |

## Training Setup

- Model: `unsloth/gpt-oss-20b` with LoRA (r=4)
- Trainer: `trl.GRPOTrainer` with `trl.GRPOConfig`
- Dataset: 1000 copies of the same prompt (diversity from temperature=1.0)
- `num_generations=2`, `max_steps=600`, `lr=2e-4`, `optim=adamw_8bit`
- ~5 hours on T4, rewards start appearing after ~100 steps

## OpenEnv-Specific Patterns

```python
# Launch environment server
from unsloth import launch_openenv
launch_openenv = functools.partial(
    launch_openenv,
    working_directory=working_directory,
    server="envs.openspiel_env.server.app:app",
    environment={**os.environ, "OPENSPIEL_GAME": "2048", ...},
    openenv_class=OpenSpielEnv,
)

# Reset and step
port, openenv_process = launch_openenv(port, openenv_process)
result = openenv_process.reset()
result = openenv_process.step(OpenSpielAction(action_id=0, game_name="2048"))

Safety Utilities (from Unsloth)

check_python_modules(code) — returns (ok, info); ok=True if only stdlib imports
create_locked_down_function(code) — sandboxed exec, no global variable leakage
execute_with_time_limit(seconds) — decorator for timeout enforcement