timaeus
/

jaxgmg_checkpoints

Model card Files Files and versions

jaxgmg_checkpoints / README.md

davidquarel's picture

Update README.md

8fff083 verified about 1 month ago

|

history blame contribute delete

1.97 kB

	### OBSOLETE

	Models trained with Matt's original training code while we were getting to grips with the setup.
	We have since moved to our own training loop in PyTorch, and these checkpoints are no longer useful.


	# JAXGMG checkpoints

	Goal Misgeneralisation Models trained
	Trained with jaxgmg, see [jaxgmg](https://github.com/timaeus-research/jaxgmg), branch `david` for more details.


	## matt-ckpt

	Path pattern:
	```
	checkpoints/dr-*/7168
	```

	Trained on 15x15 grid
	File name include the alpha parameter e,g, dr-3eneg5 means alpha=3e-5
	Not much known, but useful for debugging.


	## alpha-blocks-13x13-sweep

	Path pattern:
	```
	checkpoints/run-alpha_${alpha}-steps_200M/files/checkpoints/${checkpoint_number}
	```

	All models trained with `blocks` environment generation, world size 13x13.

	```
	#!/bin/bash
	for alpha in 1e-0 1e-1 1e-2 1e-3 1e-4 3.3e-1 3.3e-2 3.3e-3 3.3e-4; do
	python -m jaxgmg train corner --num-total-env-steps 200_000_000 --keep-all-checkpoints --num-cycles-per-checkpoint 64 --wandb-project jaxgmg2 --wandb-name alpha:${alpha}-steps:200M-theta:0 --prob-shift ${alpha} --env-size 13 --env-layout blocks
	done
	```

	Theta specifies the reward function:
	reward = proxy_goal * theta + true_goal * (1 - theta)

	All these models were trained with theta=0, i.e. the true goal of getting the cheese.

	The alpha parameter `prob-shift` controls the fraction of distinguishing v.s. undistinguishing environments.
	e.g. alpha=1 means the agetn always sees distinguishing environments (ones were the cheese is not in the corner)
	and alpha=0 means the agent always sees undistinguishing environments (ones were the cheese is always in the corner).

	Checkpoints are taken every 64 cycles (~512 env steps per cycle, ~32k env steps per checkpoint?).

	E.g. the path `run-alpha_1e1-steps_200M/files/checkpoints/128` corresponds to training with alpha=1e-1, after 128 cycles (the second checkpoint).

	# alpha-tree-13x13-sweep

	Ditto, but with `--env-layout tree`