### Not strictly obsolete, but reduced precision found to add noise to training dynamics and was discontinued. A series of training runs with alpha=0.47, gamma (discount_rate) = 0.99 with bfloat16 reduced precision. We found that the training runs were not nearly as stable [(wandb here)](https://wandb.ai/devinterp/jaxgmg_3phase_bf16) and so this path was abandonded. Models kept for posterity. Hyperparams: ``` rl_action=train num_rollout_steps=64 lr=5e-05 discount_rate=0.99 eff_horizon=None eval_every=1 use_wandb=True use_hf=True use_log=True num_total_env_steps=5000000000 checkpoint=al_0.47_g_0.99_100_bf16 render_sixel=True sixel_loc=(7, 7) seed=100 mask_type=first_episode penalize_time=False optim=adam live_monitor=False use_bf16=True checkpoint_schedule=0:8 grad_acc_per_chunk=16 num_rollout_chunks=1 cheese_loc=any env_layout=open alpha=0.47 env_size=13 num_levels=9600 f_str_ckpt=al_{alpha}_g_{discount_rate}_{seed}_bf16 wandb_project=jaxgmg_3phase_bf16 ckpt_dir=jaxgmg_3phase_bf16 duplication_factor=-1 smoke=False num_chains=6 num_draws=3000 on_policy=True nbeta=3000 localization=10 exact_solver_each_draw=False llc_optimizer=sgld iw_clip_eps=None rmsprop_burnin=20 llc_data_file=llc_scan_open_reinforce.pkl llc_checkpoint_index=0 repo_id=davidquarel/jaxgmg_ckpt_zip use_shuffled_checkpoints=0 force_re_download=False ```