### Not strictly obsolete, but reduced precision found to add noise to training dynamics and was discontinued.

A series of training runs with alpha=0.47, gamma (discount_rate) = 0.99 
with bfloat16 reduced precision. We found that the training 
runs were not nearly as stable [(wandb here)](https://wandb.ai/devinterp/jaxgmg_3phase_bf16) and so
this path was abandonded. Models kept for posterity.

Hyperparams:
```
rl_action=train
num_rollout_steps=64
lr=5e-05
discount_rate=0.99
eff_horizon=None
eval_every=1
use_wandb=True
use_hf=True
use_log=True
num_total_env_steps=5000000000
checkpoint=al_0.47_g_0.99_100_bf16
render_sixel=True
sixel_loc=(7, 7)
seed=100
mask_type=first_episode
penalize_time=False
optim=adam
live_monitor=False
use_bf16=True
checkpoint_schedule=0:8
grad_acc_per_chunk=16
num_rollout_chunks=1
cheese_loc=any
env_layout=open
alpha=0.47
env_size=13
num_levels=9600
f_str_ckpt=al_{alpha}_g_{discount_rate}_{seed}_bf16
wandb_project=jaxgmg_3phase_bf16
ckpt_dir=jaxgmg_3phase_bf16
duplication_factor=-1
smoke=False
num_chains=6
num_draws=3000
on_policy=True
nbeta=3000
localization=10
exact_solver_each_draw=False
llc_optimizer=sgld
iw_clip_eps=None
rmsprop_burnin=20
llc_data_file=llc_scan_open_reinforce.pkl
llc_checkpoint_index=0
repo_id=davidquarel/jaxgmg_ckpt_zip
use_shuffled_checkpoints=0
force_re_download=False
```