Spaces:

openenv-community
/

optigami

Sleeping

App Files Files Community

optigami / research /openenv /2048_pattern.md

sissississi

go-back (#6)

e9b7141 about 1 month ago

preview code

raw

history blame contribute delete

2.58 kB

	# 2048 Example — Code-as-Policy Pattern

	Full code: [2048_example.py](./2048_example.py)

	## Key Insight: LLM Writes Code, Not Moves

	The LLM does NOT play move-by-move. Instead:
	1. LLM receives a prompt asking it to write a `strategy(board)` function
	2. LLM generates Python code wrapped in triple backticks
	3. Code is extracted, sandboxed (no global access), and executed against the live game
	4. Reward is based on whether the strategy works

	This is "code-as-policy" — the LLM generates an algorithm, not individual actions.

	## The Prompt

	```
	Create a new short 2048 strategy using only native Python code.
	You are given a list of list of numbers for the current board state.
	Output one action for "0", "1", "2", "3" on what is the optimal next step.
	Output your new short function in backticks using the format below:
	```python
	def strategy(board):
	return "0" # Example
	```
	All helper functions should be inside def strategy. Only output the short function `strategy`.
	```

	## Three Reward Functions

	\| Function \| Score \| Condition \|
	\|---\|---\|---\|
	\| `function_works` \| +1.0 \| Valid Python that compiles \|
	\| \| -0.5 \| Right structure but exec fails \|
	\| \| -2.0 \| No function / syntax error \|
	\| `no_cheating` \| +1.0 \| Only stdlib imports \|
	\| \| -20.0 \| Non-stdlib imports \|
	\| `strategy_succeeds` \| +20.0 \| Reaches tile 2048 \|
	\| \| +2.0 \| Runs but doesn't win \|
	\| \| -1.0 \| Timeout (>5 sec) \|
	\| \| -3.0 \| Exception \|

	## Training Setup

	- Model: `unsloth/gpt-oss-20b` with LoRA (r=4)
	- Trainer: `trl.GRPOTrainer` with `trl.GRPOConfig`
	- Dataset: 1000 copies of the same prompt (diversity from temperature=1.0)
	- `num_generations=2`, `max_steps=600`, `lr=2e-4`, `optim=adamw_8bit`
	- ~5 hours on T4, rewards start appearing after ~100 steps

	## OpenEnv-Specific Patterns

	```python
	# Launch environment server
	from unsloth import launch_openenv
	launch_openenv = functools.partial(
	launch_openenv,
	working_directory=working_directory,
	server="envs.openspiel_env.server.app:app",
	environment={**os.environ, "OPENSPIEL_GAME": "2048", ...},
	openenv_class=OpenSpielEnv,
	)

	# Reset and step
	port, openenv_process = launch_openenv(port, openenv_process)
	result = openenv_process.reset()
	result = openenv_process.step(OpenSpielAction(action_id=0, game_name="2048"))
	```

	## Safety Utilities (from Unsloth)

	- `check_python_modules(code)` — returns (ok, info); ok=True if only stdlib imports
	- `create_locked_down_function(code)` — sandboxed exec, no global variable leakage
	- `execute_with_time_limit(seconds)` — decorator for timeout enforcement